May  1982 


Report.  No.  STAN-CS-82-906 


Truncated-Newton  Methods 


Stephen  Gregory  Nash 


QD'ALm  ^ 


Aeprovsifl 


rsk-as?^ 


OtmuiKiujc  Qpjxiaimt^ 


Department  of  Computer  Science 

Stanford  University 
Stanford,  CA  94305 


TRUNCATED-NEWTON  METHODS 


A  DISSERTATION 

SUBMITTED  TO  THE  DEPARTMENT  OF  COMPUTER  SCIENCE 
AND  THE  COMMITTEE  ON  GRADUATE  STUDIES 
OF  STANFORD  UNIVERSITY 
IN  PARTIAL  FULFILLMENT  OF  THE  REQUIREMENTS 
FOR  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 


by 

Stephen  Gregory  Nash 
May  1982 


TRUNCATED-NEWTON  METHODS 

Stephen  G.  Nash,  Ph.D. 

Stanford  University,  1982 

The  problem  of  minimizing  a  real-valued  function  F  of  n  variables  arises  in  many 
contexts.  Most  methods  for  solving  this  problem  have  their  roots  in  Newton’s  method,  i.e. 
they  are  based  on  approximating  E  by  a  quadratic  function  Q.  If  the  number  of  variables 
n  is  large,  then  Newton’s  method  can  be  problematic  since  it  requires  the  computation 
and  storage  of  the  Hessian  matrix  of  second  derivatives.  Use  of  finite-differencing  and 
sparse-matrix  techniques  has  overcome  some  of  these  problems  but  not  all. 

In  this  thesis,  we  examine  a  flexible  class  of  methods,  called  truncated-Newton 
methods.  They  are  based  on  approximately  minimizing  the  quadratic  function  Q  using 
an  iterative  scheme  such  as  the  linear  conjugate-gradient  algorithm.  A  truncated-Newton 
algorithm  is  made  up  of  two  sub-algorithms:  an  outer  non-linear  algorithm  controlling 
the  entire  minimization,  and  an  inner  linear  algorithm  for  approximately  minimizing  Q. 

The  most  important  choice  is  the  selection  of  the  inner  algorithm.  When  the  Hessian 
matrix  is  known  to  be  positive-definite  everywhere,  then  the  basic  linear  conjugate- 
gradient  algorithm  can  be  used.  If  not,  Q  may  not  have  a  minimum.  We  have  used 
the  correspondence  between  the  linear  conjugate-gradient  algorithm  and  the  Lanezos 
algorithm  for  tridiagonalizing  a  symmetric  matrix  to  develop  methods  for  the  indefinite 
case. 

The  performance  of  the  inner  algorithm  can  be  greatly  improved  through  the  use  of 
preconditioning  strategics.  Preconditionings  can  be  developed  using  either  the  outer  non¬ 
linear  algorithm  or  using  information  computed  during  the  inner  algorithm.  A  number 
of  diagonal  and  tridiagonal  preconditioning  strategies  are  derived  here. 

Numerical  tests  show  that  a  carefully  chosen  truncated-Newton  method  can  per¬ 
form  significantly  better  than  the  best  non-linear  conjugate-gradient  algorithms  available 
today.  This  is  important  since  the  two  classes  of  methods  have  comparable  storage  and 
operation  counts,  and  they  are  the  only  methods  available  for  solving  many  large-scale 
problems. 
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1  Introduction 


1.1.  Motivation 

The  problem  of  minimizing  a  real-valued  function  of  n  variables 

minF(a:)  (l-l-l) 

X 

arises  in  many  contexts  and  applications.  Over  the  years,  a  large  variety  of  methods  have 
been  derived  to  solve  this  problem.  Many  of  these  methods  have  their  roots  in  Newton’s 
method,  i.e.  they  are  based  on  approximating  F  by  a.  quadratic  function  using  first- 
and  second-derivative  information  at  the  current  point.  The  quadratic  function  is  then 
minimized,  and  it  is  hoped  that  this  minimum  gives  information  about  the  minimum  of 
the  original  function. 

Much  work  has  been  done  to  adapt  and  improve  this  basic  method.  In  part  the 
motivation  for  these  changes  is  that  the  basic  method  is  not  always  defined.  For  example, 
if  the  Hessian  matrix  is  indefinite  at  some  iteration,  then  the  quadratic  does  not  have  a 
minimum. 

Variations  in  the  methods  for  problem  (1.1.1)  have  also  been  derived  for  reasons 
based  on  the  nature  of  the  objective  function  F.  There  are  basically  two  difficulties  which 
can  arise.  Firstly,  if  the  number  of  variables  n  is  large,  then  storage  limitations  can  make 
it  difficult  to  store  information  about  the  problem.  Secondly,  the  second  derivatives  of 
the  function  F  may  be  very  expensive  (or  impossible)  to  compute. 

Because  Newton’s  method  in  its  traditional  form  requires  the  computation  and 
storage  of  the  n  X  n  matrix  of  second  derivatives,  it  can  be  problematic  for  both  of 
these  reasons.  Use  of  finite-differencing  and  sparse-matrix  techniques  has  overcome  some 
of  these  problems,  but  not  all. 

The  other  chief  classes  of  methods  are  Quasi-Newton  and  Conjugate-Gradient  algo¬ 
rithms.  Quasi-Newton  methods  do  not  require  any  second-derivative  information;  they 
still  require  the  storage  of  an  n  X  n  matrix.  Conjugate-gradient  methods,  however, 
remove  even  this  difficulty,  since  they  only  require  a  few  n  vectors. 

These  difficulties  are  not  overcome  without  some  cost.  In  terms  of  the  total  number 
of  iterations  (or  function  evaluations)  required  to  solve  a  minimization  problem,  Newton’s 
method  is  extremely  efficient.  Quasi-Newton  methods  can  often  approach  this  efficiency 
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on  small  problems,  but  the  performance  of  conjugate-gradient  methods  is  by  comparison 
erratic. 

These  differences  in  requirements  and  performance  for  the  three  classes  of  methods 
are  unfortunate.  They  imply  that  all  minimization  problems  must  also  be  put  into  one 
of  three  classes,  based  on  which  algorithm  is  best  capable  of  solving  them  on  a  given 
machine.  Unfortunately,  differences  between  problems  are  not  always  very  sharp.  It 
would  be  preferable  if  the  distinctions  between  algorithms  were  not  as  great. 

In  the  last  few  years,  work  has  been  done  to  fill  in  the  gap  between  conjugate- 
gradient  and  quasi-Newton  methods.  This  work  comes  under  the  category  of  limited- 
memory  Quasi-Newton  methods.  More  recently  still,  truncated-Ncwton  methods  have 
been  developed.  In  the  context  in  which  we  will  develop  them,  they  can  be  viewed  as  a 
synthesis  of  all  three  basic  methods. 

The  great  advantage  of  truncated-Newton  methods  is  their  (lexibility.  They  can  be 
adjusted  to  emulate  any  of  the  standard  algorithms  as  well  as  everything  in-between. 
They  have  variable  storage  requirements.  It  is  possible  to  adjust  them  to  use  varying 
amounts  of  second-derivative  information,  ft  is  also  possible  to  fine-tune  these  methods 
to  the  needs  of  the  problem  being  solved.  Potentially,  they  can  also  adapt  to  changes  in 
the  behavior  of  the  function  being  minimized. 

In  addition,  we  are  concerned  with  the  effect  the  computing  environment  has  on 
the  choice  of  an  algorithm.  When  using  a  large,  central  computing  facility  complete 
with  program  libraries  and  technical  consultants,  then  efficiency  and  stability  of  the 
method  are  the  only  considerations.  However,  when  a  small  machine  is  the  primary  device 
available,  then  the  size  and  complexity  of  the  program  must  also  be  taken  into  account. 
This  situation  is  becoming  ever  more  important  as  the  cost  of  small  machines  continues 
to  drop,  and  as  distributed  computing  becomes  a  more  popular  way  of  allocating  machine 
resources. 

The  main  topic  of  this  thesis  is  the  effective  implementation  of  truncated-Newton 
methods.  After  some  nece.ssary  preliminaries,  it  begins  with  a  discussion  of  the  three 
traditional  classes  of  algorithms,  along  with  a  discussion  of  the  techniques  which  arc  used 
to  make  them  useful  for  larger  classes  of  problems.  This  is  followed  by  a  description  of 
truncated-Newton  methods  in  their  most  basic  form,  along  with  a  discussion  of  some  of 
the  underlying  algorithms  that  might  be  used  to  implement  them.  Chapter  4  deals  with 


2 


convergence  criteria  for  tlie  sub-aigorithm,  and  Chapter  5  with  preconditioning,  which 
is  essential  if  these  methods  are  to  be  competitive.  Chapter  6  discusses  extensions  to 
constrained  and  least-squares  problems.  Numerical  results  arc  presented  in  Chapter  7. 
Finally,  an  extensive  discussion  of  how  to  adapt  truncated-Ncwton  methods  for  specific 
problems  (both  through  a  priori  information  and  through  dynamic  modification  of  the 
algorithm)  is  given  in  Chapter  8. 

1.2.  Notation  and  Basic  Theory 

The  principal  problem  we  arc  trying  to  solve  in  this  thesis  is 

min  F(x),  (1-2.1) 

where  F(x)  is  a  nonlinear  real-valued  function  of  the  variables 

X  = 

and  9?”  denotes  the  n-dlrncnsional  Euclidean  space.  The  gradient  of  F  will  be  denoted 
by  the  vector  g  where 

<9^(i)  .  .  „ 

~  dxi  ’  *  ~ 

and  G  will  be  used  to  denote  the  n  X  n  matrix  of  second  dcrivates,  i.e. 

^  _  d^F(x)  i  =  l,2,...,n 

dxidxj  j  =  l,2,  ...,n. 

All  methods  considered  here  for  solving  (1.2.1)  will  be  descent  methods;  that  is, 
the  value  of  the  objective  function  F{x)  will  be  decreased  at  each  iteration.  More 
specifically,  except  in  the  section  which  describes  trust-region  methods,  we  will  principally 
be  concerned  with  line-search  algorithms.  As  a  result,  all  of  our  algorithms  will  have  the 
following  general  form: 

1.2.2.  Descent  Algorithm 

Dl.  Given  x^^\  the  A:*'**  approximation  to  x*,  a  minimum  of  F[x). 

D2.  Compute  a  direction  of  search,  such  that  <  0. 

D3.  Find  >  0,  a  scalar  step-length,  such  that  < 

F’(x(''>). 
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D4.  Set  ■«—  and  return  to  step  D2. 

Step  D3  is  called  the  line-search,  and  it  will  be  discussed  later  in  this  chapter.  For 
our  purposes,  step  D2  will  be  the  most  significant,  for  the  process  used  to  compute 
generally  classifies  the  entire  algorithm.  This  computation  is  often  based  on  the  gradient 
or  the  Hessian  at  the  point  (denoted  by  g^'‘^  and  respectively),  or  on  information 
accumulated  in  previous  iterations. 

A  considerable  amount  of  the  work  in  step  D2  is  dependent  on  results  from  linear 
algebra.  The  principal  theoretical  results  will  be  presented  in  the  next  section,  but  first 
some  notational  details  will  be  discussed  here. 

In  general,  matrices  will  be  denoted  by  upper-case  Roman  letters  (G),  and  their 
elements  will  be  specified  using  subscripts  (G.y).  Vectors  will  be  denoted  by  lower-case 
Roman  letters,  with  subscripts  again  being  used  for  individual  elements  (g,  g^).  Scalars 
will  be  denoted  by  lower-case  Creek  letters  (a).  A  superfix  ^on  a  matrix  or  vector  denotes 
transpose.  ||i/l|  denotes  the  Euclidean  norm  of  the  vector  y.  Other  than  those  vectors 
already  mentioned,  in  the  context  of  optimization  there  are  two  additional  vectors  which 
have  special  meaning.  These  are 

g(fc)  =  x^k+i)  _  ^(k)^ 


the  difference  between  the  successive  estimates  of  the  minimum,  and 


yW  ^  + 


the  difference  between  the  successive  gradient  values. 

In  order  to  be  able  to  terminate  algorithm  (1.2.2),  it  is  important  to  know  how  to 
identify  x* ,  the  point  which  minimizes  the  function  F.  The  following  theorem  gives 
necessary  and  sufficient  conditions  for  the  minimum  of  an  unconstrained  real-valued 
function. 

Theorem  1.2.3  (a)  Let  x*  be  a  relative  minimum  point  of  the  twice  continuously 

differentiable  function  F.  Then  g(x*)  =  0  and  G{x*)  is  positive  semi-definite.  (See 
section  1.3  for  a  definition  of  positive  semi-definite.) 

(b)  Suppose  that  F  is  a  twice  continuously  differentiable  function  mapping  from  S?" 
to  3J.  Suppose  in  addition  that  x*  is  a  point  in  3?"  for  which  g{x*)  =  0  and  G(a:  )  is 
positive  definite.  Then  x*  is  a  strict  relative  minimum  point  of  F. 
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1.3.  Basic  Results  in  Linear  Algebra 

The  information  in  this  section  will  be  presented  briefly  and  without  proof.  A  much 
more  complete  discussion  can  be  found  in  Wilkinson  [1965],  Chapter  1. 

Definitions: 

1.  A  matrix  A  is  symmetric  if  A  =  A^. 

2.  A  symmetric  matrix  A  is  positive  definite  if 

y'^Ay  >0,  Vy  0. 

3.  A  symmetric  matrix  A  is  positive  scmi-dcBnite  if 

y'^Ay  >  0,  Vy. 


[Similar  definitions  exist  for  negative  definite  and  negative  semi-definite 
matrices.  A  matrix  falling  into  none  of  these  categories  is  called  indefinite.] 

4.  A  set  of  vectors  {ai, . . .  ,a„}  is  linearly  independent  if 


^  /?yo,  =  0 


y=i 


implies  that 


5.  The  space  spanned  by  a  set  of  vectors  is  the  space  generated  by  all 
linear  combinations  of  those  vectors. 

6.  The  rank  of  a  matrix  A  is  equal  to  the  maximum  number  of  lincarly- 
independent  rows. 

7.  The  range  of  a  matrix  A,  denoted  by  i?(A),  is  the  space  spanned  by 
the  columns  of  A. 

8.  The  nuU  space  of  a  matrix  A,  denoted  by  N(A),  is  {x  \  A^x  —  O}. 

9.  The  condition  number  of  a  non-singular  matrix  A  is  defined  to  be 


k(A)  =  ||A|| .  ha-mi, 


where  ||•||  is  the  2-norm  of  a  matrix. 

5 


10.  A  matrix  A  is  lower  (upper)  triangular  if 

Aij  =  0,  i  <  j  (t  >  j). 

A  is  unit  lower  (upper)  triangular  if,  iti  addition,  An  =  1,  Vt. 

Results: 

1.  Let  A  be  an  n  by  n  symmetric  matrix.  Then  there  exist  n  orthonormal 
vectors  vi,. ..  ,Vn  and  n  scalars  Xi, . . . ,  X„  such  that 

Avi  =  \iVi,  i  =  1 , . . . ,  n. 

The  vector  Vi  is  an  eigenvector  of  A,  and  X,-  is  its  associated  eigenvalue. 

2.  A  symmetric  matrix  of  rank  r  has  r  non-zero  eigenvalues. 

3.  A  positive-definite  matrix  has  positive  eigenvalues. 

4.  A  symmetric  matrix  A  is  positive  definite  if  and  only  if  it  can  be 
factored  as 

A  =  LDL^, 

where  L  is  unit  lower  triangular  and  D  is  diagonal  with  positive  diagonal 
entries.  [Cho/es/cy  factorization} 

1.4.  Line  Search  Techniques 

As  in  the  previous  section,  this  will  only  be  a  brief  discussion  of  the  topic  of  line 
searches.  More  complete  information  can  be  found  in  Gill  and  Murray  [1979]  and  [1974b]. 
Step  D3  of  algorithm  (1.2.2)  requires  that  a  scalar  a  be  found  such  that 

F(x  +  ap)  <  F{x).  (1-4-1) 

[The  superscript  will  be  dropped  for  reasons  of  clarity.]  One  way  to  achieve  this  is  to 
require  that 

Fix  +  ocp)  =  min  Fix  +  ap).  (1.4.2) 

a>0 

Although  necessary  for  1-dimensional  minimization,  this  condition  is  overly  stringent  in 
a  higher-dimensional  context  (and  of  little  use  for  constrained  optimization). 

At  the  opposite  extreme,  it  is  not  sufficient  to  choose  just  any  value  of  a  such  that 

6 


(1.4. 1)  is  satisfied.  To  see  this,  consider  the  simple  1-diincnsional  problem: 

F{x)  =  x^, 
a;(‘>  =  2, 

pW  =  -1,  Vik, 

=  2-''. 

It  is  easily  seen  that  the  sequence  satisfies  (1.4.1)  but  that 

lim  =  I  0  =  X*. 

fc— ►OO 

In  order  to  elTicicntly  minimize  functions  of  several  variables,  and  also  to  be  able 
to  guarantee  convergence  of  optimization  algorithms,  a  compromise  between  these  two 
positions  has  been  found  to  be  effective.  In  this  regard,  two  concepts  have  been  shown  to 
be  of  considerable  value.  The  first  is  that  the  search  direction  must  not  become  arbitrarily 
close  to  being  orthogonal  to  the  steepest  descent  direction  (that  is,  the  gradient).  Usually 
this  is  achieved  by  the  method  used  to  choose  the  search  direction. 

If  this  property  is  satisfied,  then  the  second  condition  is  that  the  function  F(x) 
must  be  “sufficiently”  reduced  at  each  iteration.  This  condition  is  often  achieved  by  an 
appropriate  choice  of  step  length  within  a  line  search  algorithm.  One  such  algorithm  is 
presented  here. 

(1.4.3)  Line  search  algorithm 

Let  {ccj  }  define  a  sequence  of  points  that  tend  in  the  limit  to  the  minimum 
of  F{x)  along  p.  (If  F{x)  is  smooth,  this  sequence  can  be  computed  by 
means  of  some  safeguarded  polynomial  interpolation  algorithm.)  Let  t  be 
the  index  of  the  first  member  of  this  sequence  such  that 

|g(x  +  atvYp\  <  (l-'l-4) 

where  7/  (0  <  7/  <  1)  is  some  constant  scalar.  Let  /i  (0  <  /i  <  i)  be 
another  constant  scalar.  Find  the  smallest  non-negative  integer  r  such 
that 

F(x)  —  F’(x  +  2“’'atp)  >  — 2“’'at/i(7^p  (1.4.5) 

and  set  a  =  2“’'at.| 

If  a  is  computed  according  to  this  rule,  it  can  be  shown  (Gill  and  Murray  (1973a]) 


7 


that 


F(x)  -  F[x  +  ap)  > 

where  is  a  real-valued  function  such  that,  for  any  sequence  {cjt}, 

lim  <f>{ck)  =  0  implies  lim  cjt  =  0. 

fc“»-oo  k—*oo 

An  important  property  of  conditions  (1.4.4)  and  (1.4.5)  is  that  if  p,  is  chosen  as  a  small 
value  (say  10“^)  then,  unless  F{x)  is  a  pathologically  ill-behaved  function,  any  value  of 
at  satisfying  (1.4.4)  automatically  satisfies  (1.4.5)  with  r  =  0.  In  this  case  the  line  search 
algorithm  reduces  to  finding  a  scalar  a  such  that 

\g{x  -f  ap)'^p\  <  -pg^p. 

The  value  of  p  can  be  specified  by  the  user  and  can  be  used  to  give  a  step  length  that  is 
well-suited  to  the  problem  being  solved.  If  p  is  chosen  as  0.9,  the  algorithm  will  generally 
compute  a  “crude”  value  of  a,  provided  it  satisfies  (1.4.5).  This  value  will  often  be  ao, 
the  initial  guess  for  a.  If  tj  is  chosen  as  zero,  a  will  satisfy  (1.4.2),  the  condition  for  an 
exact  line  search. 

In  order  for  certain  asymptotic  convergence  rates  to  be  attained,  it  is  often  necessary 
that  ultimately  a  =  1  for  VA:  >  K.  For  this  reason,  ao  =  1  is  a  common  feature  of  line- 
search  algorithms,  but  it  is  certainly  not  the  only  possibility.  Davidon  [1959]  suggested 
the  following  choice  for  ao 

ao  = 

whenever  the  quantity  on  the  right  hand  side  is  less  than  or  equal  to  1.  Here,  is  a 

user-specified  estimate  of  the  function  value  at  the  solution.  (It  is  common  for  the  user 
to  have  some  a  priori  information  about  the  function  F.  If  not,  the  choice  ao  ==  1  can 
be  used.)  This  formula  has  been  found  to  be  quite  successful  computationally. 

For  nonlinear  conjugate-gradient  algorithms  (described  in  the  next  chapter),  a  fur¬ 
ther  condition  in  the  line  search  is  required  in  order  to  insure  the  descent  property  of  the 
search  direction  at  the  next  iteration.  Details  of  these  requirements  can  be  found 

in  Gill  and  Murray  [1979]. 

One  final  remark  that  is  relevant  to  the  general  topic  of  this  thesis  concerns  the 
computation  of  the  sequence  {ay}  in  (1.4.4).  In  many  situations,  this  sequence  will  be 
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computed  using  function  and  gradient  values.  However,  when  the  gradient  and  function 
are  expensive  to  compute  relative  to  the  cost  of  computing  the  function  alone  (i.e.  twice 
the  cost),  this  sequence  can  be  computed  using  function  values  only  (see  Gill  and  Murray 
(1974b]). 


1.5.  Rates  of  Convergence 

In  the  chapters  to  follow,  we  will  be  concerned  with  the  speed  at  which  various 
algorithms  converge.  The  following  definitions  will  be  useful  for  our  purposes. 

1.  A  sequence  converging  to  x* ,  is  said  to  converge  with  Q-order 


hm  — TTT - —  <  oo. 

k-^oo  ||a:W  —  a:  II’" 

2.  The  sequence  is  said  to  converge  Q-superlinearly  if 

fe— fOO  j|l{*')  —  *  II 

3.  The  sequence  is  said  to  converge  with  R-order  m  if 


(1.5.1) 


(1.5.2) 


||x(^^  -  x*||  <  A:  =  0,1,... 


(1.5.3) 


where  {^Sfe}  is  a  sequence  that  converges  to  zero  with  Q-order  m. 

The  most  important  instances  of  definition  1  are  m  =  1  (Q-linear  convergence) 
and  m  =  2  (Q-quadratic  convergence).  Because  Q-order  rates  of  convergence  are  more 
important  for  our  purposes,  the  label  Q-  will  often  be  dropped  in  the  discussions  to  follow. 
Only  R-order  rates  will  be  distinguished. 

For  a  more  complete  treatment  of  rates  of  convergence,  see  Ortega  and  Rheinboldt 
[1970],  pp.  281-298. 
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2  The  Basic  Methods 


2.1  Introduction 

Although  we  have  so  far  mentioned  only  three  basic  classes  of  methods  for  solving 
the  unconstrained  minimization  problem  (1.2.1),  within  each  class  there  are  a  great  many 
varieties,  and  choice  of  the  exact  algorithm  to  use  is  not  always  easy.  In  the  absence  of 
other  considerations,  Newton’s  method  is  almost  always  the  method  of  choice,  at  least 
in  its  modern  safeguarded  versions.  Other  methods  can  be  considered  as  compromises  to 
Newton’s  method. 

The  three  classes  of  methods  will  be  discussed  from  that  perspective,  since  it  leads 
naturally  to  the  discussion  of  truncated-Newton  methods  in  the  next  chapter.  Newton’s 
method  will  be  discussed  first;  this  will  be  followed  by  descriptions  of  quasi-Newton  and 
conjugate- gradient  methods.  In  the  last  section,  extensions  and  adaptations  of  these 
algorithms  to  larger  classes  of  problems  are  presented. 

It  should  be  pointed  out  that  all  the  methods  in  this  and  the  next  chapter  will  be 
developed  as  linesearch  algorithms.  Alternative  versions  of  these  methods  employing 
trust-region  schemes  will  be  presented  in  a  later  chapter. 

2.2  Newton’s  Method 

Since  it  is  impossible  to  have  a  direct  method  for  solving  the  unconstrained  mini¬ 
mization  problem  (1.2.1)  in  general  (such  a  direct  method  would  imply  that  there  existed 
a  direct  method  for  finding  roots  of  arbitrary  polynomials),  we  must  rely  on  iterative 
methods.  In  the  case  of  minimization,  these  usually  take  the  form  of  finding  some  ap¬ 
proximation  to  the  objective  function  F,  computing  its  minimum,  and  then  using  this 
point  to  compute  a  better  approximation  to  the  minimum  of  the  original  function. 

For  Newton’s  methods,  this  approximate  function  is  based  on  the  expansion  of  F  in 
a  Taylor  series: 

+  p)  =  F(x(°))  -H  p^VF(x(°))  -b  ip^V2F(xW)p  -f  R3(x(°^p),  (2.2.1) 

where  R3(x^”^,p)  represents  the  higher-order  terms  in  the  series.  From  now  on,  we  will 
denote  the  gradient  of  F  as 

g  =  3(x)  =  VF(i), 
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and  the  Hessian  as 


G  =  G{x)  =  V^F{x). 

To  derive  Newton’s  method,  we  drop  the  remainder  term  in  (2.2.1).  If  we  denote  = 
F(x^^^),  then 

min  F(x)  =  min  +  p) 

*  p 

pa  min  Q(p) 


(2.2.2) 


=  min 
p 


/P(0)  ^  gTp  ^  Xp'^Gp 


We  have  now  approximated  F  by  a  quadratic  function.  Taking  the  derivative  of  Q{p)  in 
(2.2.2)  and  setting  it  equal  to  zero,  we  obtain  the  so-called  Newton  equations  for  the  step 
to  the  minimum  of  the  quadratic  function: 


Gp  =  -g.  (2.2.3) 

This  formula,  and  its  equivalent  version  in  (2.2.2)  will  be  fundamental  in  the  work  to 
follow. 

In  its  basic  form,  Newton’s  method  cannot  be  used,  since  it  is  not  always  defined 
or  meaningful.  For  example,  away  from  the  solution  x* ,  there  is  no  guarantee  that  the 
matrix  G  will  be  positive  definite.  This  means  that  (2.2.3)  may  not  have  a  solution,  or 
that  the  minimum  in  (2.2.2)  may  not  exist.  Also,  we  are  insisting  that  the  search  direction 
p  be  a  descent  direction,  i.e.  p'^g  <  0,  which  may  not  be  true  if  G  is  not  positive  definite. 

Many  authors  have  suggested  ways  of  safe-guarding  Newton’s  method  so  that  the 
search-direction  will  be  appropriately  defined  at  each  iteration.  The  most  successful 
of  these  schemes  involve  replacing  G  by  a  related  positive-definite  matrix  (see  Murray 
[1972]). 

Indefiniteness  is  usually  detected  and  corrected  by  computing  and,  if  necessary,  ad¬ 
justing  some  decomposition  of  the  matrix  G.  Greenstadt  [1967]  proposed  using  a  spectral 
decomposition  and  replacing  negative  eigenvalues  with  their  moduli.  Unfortunately,  this 
is  a  very  expensive  operation — and  frequently  unnecessary  as  G  is  often  positive  definite. 
Indefiniteness  can  also  be  detected  using  a  Cholesky  factorization  (or  one  of  its  variants). 
Details  of  this  work  can  be  found  in  Gill  and  Murray  [1974a],  Bunch  and  Parlett  [1971], 
Dax  and  Kaniel  [1977],  etc. 

It  is  not  enough  to  replace  G  by  any  positive-definite  matrix  G.  If  the  norm  of 
E  =  G  —  G  \s  large,  then  jjp]]  will  be  small,  and  the  algorithm  may  not  converge.  It  is 
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thus  important  to  be  able  to  bound  the  norm  of  the  modification  matrix  E. 

Simply  replacing  G  by  a  positive-definite  matrix  is  not  sulRcient  to  define  a  method. 
Other  problems  can  also  occur  during  a  minimization  process.  For  example,  it  is  not 
always  clear  how  to  obtain  a  descent  direction  at  a  saddle  point,  that  is,  when  =  0 
and  G  is  not  positive  definite.  The  methods  described  in  the  previous  paragraph  can  all 
be  used  to  compute  a  direction  of  negative  curvature  in  this  event.  Such  a  direction  is 
defined  by  the  condition 

p^Gp  <  0. 

An  advantage  of  the  Gill  and  Murray  approach  is  its  simplicity,  and  there  is  no 
evidence  that  it  is  less  efficient  than  alternative  methods.  Since  the  truncated-Newton 
algorithm  in  the  next  chapter  owes  much  to  this  method,  some  details  of  the  algorithm 
will  be  discussed  briefly  here. 

This  algorithm  is  based  on  the  result  that  a  symmetric  matrix  G  is  positive  definite 
if  and  only  if  it  can  be  factored  in  the  form 

G  =  LDL"^,  (2.2.4) 

where  D  is  a  diagonal  matrix  with  positive  diagonal  entries  and  L  is  a  unit  lower  triangular 
matrix.  This  suggests  the  following  technique.  Attempt  to  compute  the  Cholesky 
factorization  (2.2.4)  of  G.  If  at  any  stage  Du  turns  out  to  be  negative  or  zero,  add  some 
positive  quantity  Eu  to  Gu  to  correct  the  problem.  Monitor  the  size  of  the  elements  of 
L,  and  if  they  are  “too  large,”  further  increase  the  size  of  Eu.  Now  continue  with  the 
factorization. 

Exact  formulas  for  this  process  can  be  found  in  the  Gill  and  Murray  paper.  The  end 
result  is  that  we  have  computed  the  Gholesky  factorization  of 

G  +  E=^  LDL^,  (2.2.5) 

where  E  =  diag(E,-,).  It  can  be  shown  that  E  is  identically  zero  whenever  G  is 
“sufficiently  positive-definite.”  Since  it  would  probably  be  necessary  in  any  case  to,  com¬ 
pute  the  factorization  (2.2.4)  to  solve  the  Newton  equations  (2.2.3),  the  factorization 
(2.2.5)  can  be  considered  as  a  natural  by-product  of  the  basic  Newton  method.  Using 
this  factorization,  we  obtain  the  modified-Newton  equations  for  the  step  to  the  minimum 
of  the  modified  quadratic'  function: 

(G  +  E)p  =  -g.  (2.2.6) 
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The  resulting  modificd-Newton  algorithm  can  be  described  as  follows: 

(2.2.6)  Modified-Newton  Algorithm 
Nl.  Given  x. 

N2.  Compute  p  from  (2.2.3),  (2.2.6),  or  related  formulas. 

N3.  Find  a  such  that  F{x  +  ap)  <  F{x). 

N4.  Set  X  X  +  ap  and  return  to  step  N2. 

This  algorithm  has  been  left  deliberately  vague  in  order  to  allow  for  the  many  possibilities 
discussed  above  for  steps  N2  and  N3. 

Under  a  variety  of  conditions,  it  can  be  shown  that  the  above  algorithm  is  globally 
and  locally  quadratically  convergent.  Although  more  general  results  can  be  found  (see, 
for  example,  Ortega  and  Rhcinboldt  [1970],  pp.  421-430),  the  following  theorem  will  be 
adequate  for  our  purposes. 

Theorem  (2.2.7)  Suppose  that  F  :  3?”  — 3i  is  twice  continuously  differentiable  in  an 
open  convex  set  D  and  that  there  is  an  i*  in  D  such  that  g{x*)  =  0  and  G{x*)  is  positive 
definite.  Then  there  is  an  open  set  S  which  contains  x*  such  that  for  any  x^^^  6  S  the 
Newton  iterates  are  well-defined,  remain  in  S  and  converge  to  x* .  Moreover,  there  is  a 
constant  P  such  that 

||j;(fc+l)  _  a;*||  <  ^||a;(fc)  _  x*  f  ,  ik  =  0,  1,  .  .  .  . 

2.3.  Quasi-Newton  Methods 

The  methods  of  this  section  have  also  been  referred  to  in  the  literature  as  secant 
methods,  variable-metric  methods,  etc.  Here,  we  will  refer  to  them  as  quasi-Newton 
methods. 

In  the  previous  section  we  computed  the  search  direction  p^^^  as  the  minimum  of  a 
strictly  convex  quadratic  function  or,  equivalently,  as  the  solution  of  the  system  of  linear 
equations  (2.2.6).  Both  of  these  methods  require  the  Hessian  matrix  G^'‘\  It  might  be 
hoped  that  similarly  successful  methods  might  be  derived  from  minimizing 

Q(p)  =  (2-3.1) 

or  equivalently,  solving 
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(2.3.2) 


where  is  some  suitably  chosen  approximation  to 

In  one  dimension,  there  is  a  well-known  algorithm  of  this  type,  called  the  secant 
method.  In  that  method. 


g(k+l)  _  g(k) 


yW 

aW 


Multiplying  through  by  gives 


(2.3.3) 


Considering  the  methods  derived  from  (2.3.2)  as  extensions  of  the  one-dimensional 
secant  method,  it  is  natural  to  insist  that  (2.3.3)  be  satisfied  in  the  n-dimensional  case  as 
well.  In  this  context,  (2.3.3)  will  be  referred  to  as  the  quasi-Newton  condition  and  will 
be  usd  to  define  this  class  of  methods. 

In  n  dimensions,  (2.3.3)  is  insufficient  to  uniquely  specify  and  further  restric¬ 

tions  are  required  to  insure  that  an  algorithm  of  this  type  is  well-defined.  Although  the 
historical  development  of  these  methods  was  somewhat  haphazard,  it  is  now  known  that 
all  the  major  quasi-Newton  methods  can  be  derived  in  the  following  fashion: 


(2.3.4)  Quasi-Newton  Update 
Ul.  Given 

U2.  Choose  a  set  of  properties  P  which  must  preserve.  (Some 

typical  choices  are  symmetry,  positive-definiteness,  a  particular  sparsity 
pattern,  etc.) 

U3.  Choose  a  matrix  norm  H-Hm* 

U4.  Define  as  the  matrix  which  satisfies  (2.3.3),  preserves  the 

properties  P,  and  which  minimizes 


Various  specific  updates  are  then  obtained  by  specifying  the  set  of  properties  P  and  by 
choo.sing  a  norm  IHlAf-  The  following  few  paragraphs  will  outline  the  principal  updates 
in  use  today.  [For  simplicity  in  the  following  discussion,  we  will  denote  B  =  B^'‘\  B  = 

A  simple  formula,  known  as  Broyden’s  update,  is  obtained  by  letting  P  be  void  and 
by  choosing  IHlAf  to  be  the  Frobenius  norm.  Then 


B  =  B-i- 


(y  -  Bs)s'^ 


(2.3.5) 
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For  solving  systems  of  non-linear  equations,  (2.3.5)  is  quite  satisfactory,  but  for 
function  minimization  it  is  inadequate.  For  example,  because  the  Hessian  matrix  is 
symmetric,  it  would  be  desirable  to  insist  that  the  update  formula  have  the  property  of 
hereditary  symmetry;  that  is,  that  B  be  symmetric  whenever  B  is. 

Two  choices  for  II-Hm  lead  to  the  two  following  symmetric  updates.  If  H-HAf  is  chosen 
as  the  Euclidean  2-norm,  then  the  symmetric  rank-one  formula  is  obtained: 


-5  _  D  .  (y-  -  Bsf 


(2.3.6) 


On  the  other  hand,  if  II'IIa/  is  chosen  to  be  the  Frobenius  norm,  then  the  Powell  Symmetric 
Broyden  (PSB)  update  is  obtained: 


^PSB  =  B  + 


{y  —  Bs)s'^  +  s{y  —  Bs)^  (y  —  Bs)^s 


(2.3.7) 


Another  property  of  considerable  significance  is  hereditary  positive-definiteness.  If 
the  matrix  B  is  guaranteed  to  be  positive  definite  and  bounded,  then  the  search  direction 
is  guaranteed  to  be  a  descent  direction.  If  a  positive-definite  approximation  is  not  chosen, 
then  modification  schemes  similar  to  those  necessary  for  Newton’s  method  would  need 
to  be  invoked.  Since  G  is  positive  semi-definite  at  the  solution,  the  restriction  that  B  be 
positive  definite  will  not  asymptotically  prevent  supcrlinear  convergence. 

The  results  for  positive-definite  updates  are  not  quite  as  simple  as  for  symmetric 
updates.  Again,  choosing  II'Hm  to  be  the  2-norm  or  the  Frobenius  norm  leads  to  the 
following  two  update  formulas.  But  in  this  situation,  the  vectors  y  and  s  are  no  longer 
arbitrary.  For  the  updates  to  retain  positive-definiteness,  it  is  necessary  that  yTs  >  0. 
This  property  can  be  assured  to  hold  by  performing  a  sufficiently  accurate  line  search. 
With  this  in  mind,  then,  the  two  updates  are,  with  the  2-norm  (the  Davidon-Fletcher- 
Powell  (DFP)  update): 


^DFP  =  B  + 


[y  -  Bs)y^+  y[y  —  Bs)"^  {y  -  Bs}^s 


(2.3.8) 


and  with  the  Frobenius  norm  (the  Broyden-Fletcher-Goldfarb-Shanno  (BFGS)  update): 


jBbfgs  —  B  + 


yy"^  {Bs){Bsy 


(2.3.9) 


Computationally,  it  has  been  found  that  (2.3.9)  is  the  most  successful  update  known  for 
general  minimization  (see  Gill,  Murray,  and  Pitfield  [1972]). 


15 


A  quasi-Newton  approximation  can  be  initialized  in  a  number  of  ways.  The  simplest 
choice,  and  the  only  choice  in  the  absence  of  addition  information  about  the  Hessian, 
is  to  set  =  I.  If  it  is  feasible  to  compute  some  second-derivative  information, 
this  can  also  be  used  to  initialize  the  approximation,  as  can  information  from  previous 
minimization  steps,  if  the  problem  being  solved  is  one  of  a  sequence  of  similar  problems. 
If  a  trust-region  strategy  is  being  used  in  combination  with  the  quasi-Newton  method, 
then  the  bounds  on  the  variables  can  be  used  to  derive  an  initial  approximation  to  the 
Hessian  (see  Powell  [1970]). 

To  compute  the  search  direction,  it  is  necessary  to  solve  the  system  of  linear  equa¬ 
tions  (2.3.2).  It  may  appear  that  this  will  require  O(n^)  operations  at  each  iteration,  but 
actually,  it  is  possible  to  reduce  this  to  O(n^)  operations  in  either  of  two  ways. 

The  first  observation  is  that  since  all  of  the  update  formulas  are  of  low  rank,  it  is 
simple  to  apply  the  Sherman-Morisson  formula  and  obtain  low-rank  updates  for  H  = 
5“^  (see  Stewart  [1967]).  Using  the  inverse  update,  the  solution  of  (2.3.2)  would  amount 
to  no  more  than  multiplication  by  the  matrix  H,  an  O(n^)  process.  However,  this  is 
unstable.  • 

The  second  idea,  and  this  is  what  is  used  in  the  best  algorithms  today,  is  to  update 
a  factorization  of  B  rather  than  B  itself.  For  example,  if  a  symmetric  approximation  B 
is  stored  in  the  form 

B  =  LL^, 

then  solution  of  (2.3.2)  involves  only  two  back-substitutions,  again  an  O(n^)  process. 
Details  of  how  various  matrix  factorizations  can  be  updated  elficiently  can  be  found  in 
Gill,  Golub,  Murray  and  Saunders  [1974]. 

Under  fairly  mild  restrictions,  quasi-Newton  methods  can  generally  be  shown  to 
exhibit  global  and  superlinear  convergence.  The  following  theorem  (from  Dennis  and 
More  [1977],  page  82)  is  typical  and  adequate  for  our  purposes. 

Theorem  2.3.1.  Let  F  :  5R"  9?  be  twice  continuously  differentiable  in  an  open 

convex  set  D,  and  assume  that  g[x  )  =  0  and  that  G{x  )  is  positive  definite  for  some  x 
in  D.  Suppose  in  addition  that 

||G(a;)-G(x*)||<7ll>^-x*|l 

for  some  constant  7  and' for  all  x  in  D.  Suppose  that  the  algorithm  (1.2.2)  is  implemented 
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by  choosing 


5(fc)p(^)  =  -gW 

where  is  obtained  using  a  BFGS  or  DFP  update  with  =  1.  Also,  suppose  that 
is  determined  using  the  line  search  formulas  (1.4.4)  and  (1.4.5),  and  that 

f;ii*w-x‘ii<+oo. 

/t=o 

Then  converges  superlinearly  to  i*.| 

2.4.  Nonlinear  Conjugate-gradient  Algorithms 

The  (linear)  conjugate-gradient  algorithm  of  Hestenes  and  Stiefel  [1952]  was 
originally  designed  to  solve  the  system  of  linear  equations 

Ax  =  b  (2.4.1) 

where  A  is  a  positive-definite  square  matrix.  As  indicated  earlier,  this  is  equivalent  to 
minimizing  the  quadratic  form 


Q(x)  =  \x'^Ax  —  x'^b.  (2.4.2) 

For  clarity,  we  will  give  an  outline  of  the  algorithm  here.  A  more  complete  derivation 
from  another  point  of  view  can  be  found  in  section  3.2. 

The  solution  x  of  (2.4.1)  will  be  computed  as  a  linear  combination  of  A-conjugate 
directions.  That  is, 

a:  =  53  (2-4.3) 

% 

where 

pjApj  =  0,  for  if^j. 

Notice  that  the  concept  of  A-conjugacy  is  equivalent  to  the  concept  of  orthogonality  if 
the  inner-product  is  defined  as 

{PUP2)  =  pfAp2. 

This  implies  that  any  set  of  A-conjugate  vectors  will  also  be  linearly  independent. 

If  the  representation  (2.4.3)  for  x  is  substituted  into  (2.4.2),  then  minimizing  Q{x) 


17 


becomes  equivalent  to  solving  a  sequence  of  onc-diincnsional  minimization  problems: 


min  Q[x)  =  min  ttfcPfc) 

=  mjn  a.p.)  “iPj)  “ 


=  min 

a 


in  |l  2  “‘“yPi^Pj  -  S 

=  ^  ^min  {^Oi^pfApi  -  a.pffe}^ 


(due  to  the  yl-conjugacy  of  the  vectors  p,).  Each  term  in  the  final  summation  is  a 
minimization  problem  involving  only  the  coefficient  a,,  so  that  the  original  minimization- 
problem  (2.4.2)  has  been  completely  decoupled  into  a  set  of  trivial  one-dimensional 
minimization  problems. 

The  linear  conjugate-gradient  algorithm  is  quite  simple  to  describe.  Let  xq  be  some 
initial  guess  of  the  solution  to  (2.4.1).  At  each  stage,  compute  the  current  residual  = 
b  —  Ax/f.  If  Tk  —  0,  accept  Xk  as  the  solution  of  (2.4.1)  and  terminate  the  iteration. 
Otherwise,  compute  a  new  A-conjugate  direction  pk  (using  the  current  residual  and  the 
previous  A-conjugate  direction  Pfc-i),  and  minimize  Q{x)  along  the  line  which  starts 
at  Xk  and  moves  in  the  direction  pk  (this  corresponds  to  minimizing  one  term  in  the 
summation  above). 

Although  the  conjugate-gradient  algorithm  is  described  as  an  iterative  method,  it  will 
terminate  after  a  finite  number  of  iterations  in  exact  arithmetic.  If  m  is  the  dimension  of 
the  system  of  equations,  then  there  can  be  at  most  m  A-conjugate  vectors  {pfc}  (because 
A-conjugate  vectors  must  also  be  linearly  independent).  Since  the  solution  x  to  (2.4.1)  is 
expressed  in  terms  of  {pfc}>  the  conjugate-gradient  algorithm  must  converge  in  at  most 
m  iterations. 
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One  computational  form  of  the  linear  conjugate-gradient  algorithm  is  as  follows: 
Given  an  initial  guess  xq. 

For  A:  =  0, 1, . . . 

rk  =  b-  Axk 

Pk  =  rlrk  frl_  fk- 1,  /?0  =  0 

Pk  =  'rk  +  PkPk-x 

ftfc  =  rlrklvlApk 

=  Xk+  OikPk- 

A  useful  property  of  this  algorithm  is  that  the  vectors  {r*}  are  mutually  orthogonal: 

rfry  =  0,  if  i  ^  j, 
and  the  vectors  {pk}  are  mutually  A-conjugate: 

pfApj  —  Q,  if  i^j. 

It  is  a  relatively  simple  problem  to  adapt  the  linear  conjugate-gradient  algorithm 
to  solve  general  nonlinear  problems.  Nonlinear  conjugate-gradient  algorithms  are  based 
on  the  Newton  formulas  (2.2.2)  and  (2.2.3).  In  this  case,  the  underlying  assumption  is 
that  F{x)  is  a  quadratic  function,  i.e.  a  function  with  a  constant  Hessian  matrix.  The 
formulas  for  the  linear  conjugate-gradient  algorithm  are  then  applied  to  the  nonlinear 
function. 

The  nonlinear  conjugate-gradient  algorithm  computes  a  new  search  direction  using 
the  formula 

pik+i)  —  _g{k+i)  p{k)p{k)^  (2.4.4) 

The  determination  of  is  now  a  univariate  minimization  problem.  Various  choices  for 
the  scalar  lead  to  the  various  versions  of  the  algorithm.  The  three  principle  formulas 
(all  equivalent  in  the  case  where  F  really  is  a  quadratic  function)  are: 

1.  Fletcher-Reeves 

=  ||5l‘+‘'|ll/||9<‘>|||  (2.4.5) 

2.  Hestenes-Stiefel 

pW  ^y(k)'^g(k+i)/y[k)'^p{k)  (2.4.6) 
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.  3.  Polak-Ribiere 


DW  =  !,'*>  V+'>/ll9“’lli- 


(2.4.7) 


In  the  non-quadratic  casp,  these  formulas  have  quite  distinct  properties.  For  example, 
(2.4.6)  guarantees  that  p(^+i)  =  0  (a  property  of  the  quadratic  case)  irrespective  of 
the  accuracy  of  the  line  search  or  any  possible  non-quadratic  behavior  of  the  objective 
function.  Also,  Powell  [1977]  has  shown  that  (2.4.5)  may  lead  to  slow  convergence  in  the 
general  nonlinear  case  when  exact  line  searches  are  performed. 

In  the  quadratic  case,  the  conjugate-gradient  algorithm  will  theoretically  converge 
in  n  iterations  and  will  have  generated  n  linearly  independent  search  directions 
For  this  reason,  Fletcher  and  Reeves  [1964]  suggested  abandoning  (2.4.4)  after  a  cycle 
of  n  line  searches,  and  setting  as  the  steepest  descent  direction  This 

strategy  is  known  as  restarting.  Since  then,  Powell  [1977]  and  others  have  suggested 
other  restarting  strategies,  and  these  have  considerably  improved  the  performance  of 
nonlinear  conjugate-gradient  algorithms. 

McCormick  and  Pearson  [1969]  have  shown  that,  for  a  wide  class  of  functions,  the 
restarted  conjugate-gradient  algorithm  is  n-step  superlinearly  convergent,  i.e. 


||x("*=+”)-a:*||  „ 

lim  .  .. - ^17“  =  0. 

k-*oo  ]|x(”*')  —  a:  11 


(2.4.8) 


This  result  depends  critically  on  the  use  of  restarting.  Powell  [1976]  has  shown  that 
algorithms  that  do  not  contain  a  restarting  strategy  almost  always  converge  linearly. 

The  result  (2.4.8)  may  be  somewhat  misleading.  In  general,  a  successful  conjugate- 
gradient  algorithm  should  converge  in  2n  to  3n  iterations,  so  that  asymptotic  behavior 
of  the  sort  described  by  (2.4.8)  would  never  be  observed.  Thus,  for  practical  purposes, 
nonlinear  conjugate-gradient  algorithms  can  be  considered  to  exhibit  a  linear  rate  of 
converge,  and  the  motivation  for  restarting  is  not  to  achieve  a  superlinear  rate  of  con¬ 
vergence. 


2.5.  Adaptations  and  Extensions  of  the  Traditional  Methods 

The  three  main  classes  of  methods  are  each  associated  with  a  particular  class  of 
problems  to  which  they  are  ideally  suited.  The  correspondences  are  as  follows: 

1.  Newton’s  method:  small  and  moderately  sized  problems  where  the 
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Hessian  matrix  is  easy  or  cheap  to  compute. 

2.  Quasi-Newton  methods:  small  and  moderately  sized  problems  where 
the  Hessian  is  difficult  or  expensive  to  compute. 

3.  Conjugate-gradient  methods:  large  problems. 

• 

For  small  problems,  whenever  possible  a  user  would  prefer  to  use  Newton’s  method  over 
quasi-Newton  methods,  and  quasi-Newton  methods  over  conjugate-gradient  methods, 
because  the  relative  efficiencies  of  the  former  methods  are  so  much  better.  It  should 
be  remembered,  however,  that  for  particular  problems  we  cannot  be  assured  that  a 
given  method  works  better  than  another.  There  are  reasons  to  suppose  that  the  relative 
efficiencies  of  the  three  approaches  are  different  for  large  problems,  when  it  is  possible 
to  still  apply  all  three  methods  (see  Thapa  [1980],  Gill  and  Murray  [1979]). 

Much  work  has  been  done  to  extend  Newton  and  quasi-Newton  methods  to  larger 
classes  of  problems,  and  to  modify  conjugate-gradient  methods  to  give  them  some  of  the 
properties  of  the  other  two  classes  of  methods. 

Newton’s  method  is  relatively  easy  to  extend  to  large  problems.  At  each  iteration, 
it  is  necessary  to  solve  a  system  of  linear  equations  and,  provided  the  Hessian  is  sparse, 
this  can  be  achieved  using  sparse  matrix  methods  (see,  for  example.  Bunch  and  Rose 
[1976]).  For  most  problems,  however,  second  derivative  information  will  not  be  available. 
By  using  a  finite-difference  approximation  to  G,  Newton’s  method  can  be  extended  to 
the  case  where  only  first  derivatives  are  known.  In  the  dense  case,  this  implies  at  least 
n  additional  gradients  would  be  required.  For  large  problems  or  even  moderate-size 
problems,  this  is  a  prohibitive  cost.  Fortunately,  if  the  Hessian  is  sparse  it  can  be 
approximated  in  considerably  less  than  n  gradient  evaluations.  Details  of  this  work  can 
be  found  in  Thapa  [1980],  Powell  and  Toint  [1979]. 

A  second  possibility  is  to  hold  G  fixed  for  several  iterations.  This  idea  has  been 
explored  by  Brent  [1973],  but  has  been  found  to  be  less  effective  than  using  quasi-Newton 
methods  (see,  for  example,  Broyden  [1971]).  For  special  problems  this  method  can  work 
well.  When  the  cost  of  factoring  the  Hessian  matrix  dominates  the  cost  of  computing  it, 
or  when  the  Hessian  matrix  is  nearly  constant,  this  can  be  the  method  of  choice. 

The  extension  of  quasi-Newton  methods  to  large  problems  is  much  more  complex. 
The  updates  given  in  section  2.3  will  not  in  general  give  a  sparse  Hessian  approximation 
even  if  the  actual  Hessian  is  sparse.  To  overcome  this  deficiency,  sparse  updates  have 
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been  developed.  This  involves  adding  a  sparsity  condition  in  step  U2  of  the  update 
algorithm  (2.3.4).  Schubert  [1970]  developed  a  sparse  update  for  nonlinear  systems  of 
equations.  A  useful  update  for  optimization  was  derived  independently  by  Marwil  [1978] 
and  Toint[l978].  With  these  updates  available,  sparse  matrix  techniques  are  then  used 
to  solve  the  system  of  equations  (2.3.2)  for  the  search  direction  p. 

Currently,  it  appears  that  sparse  quasi-Newton  methods  are  chiefly  of  theoretical 
interest.  First  of  all,  the  sparse  updates  are  no  longer  of  low  rank  (in  fact,  they  are  in 
general  of  rank  n)  so  that  the  converse  of  the  dense  case  is  now  true  and  quasi-Newton 
methods  take  more  algebraic  operations  per  iteration  than  sparse  Newton  methods. 
Secondly,  although  they  can  be  shown  to  converge  theoretically  at  a  superlinear  rate, 
the  asymptotic  regime  is  in  practice  slow  to  set  in,  and  these  methods  are  usually  less 
efficient  than  sparse-Newton  methods  (see  Thapa  [1980]). 

Steihaug  [1980]  has  derived  a  class  of  sparse  approximate  quasi-Newton  updates, 
and  has  shown  that  algorithms  based  on  them  can  be  made  to  converge  globally  and 
superlinearly.  This  class  is  obtained  by  computing  a  sparse  quasi-Newton  update  itera¬ 
tively  using  the  linear  conjugate-gradient  algorithm.  No  practical  experience  has  yet 
been  reported  on  these  methods. 

Although  both  Newton  and  quasi-Newton  methods  will  continue  to  be  improved, 
they  critically  depend  on  the  assumption  that  the  Hessian  matrix  is  sparse.  For  uncon¬ 
strained  problems,  this  is  almost  always  true.  For  constrained  problems,  however,  the 
equivalent  equations  that  require  solution  involve  the  projection  of  the  Hessian  matrix. 
Although  the  Hessian  matrix  and  the  matrices  defining  the  projection  may  be  sparse,  the 
projected  Hessian  is  often  dense.  Except  in  special  cases,  for  such  problems  the  direct 
application  of  Newton  or  quasi- Newton  methods  (as  we  have  described  them  so  far)  is 
unlikely  to  be  successful. 

Because  conjugate-gradient  methods  are  so  frugal  in  their  storage  requirements, 
attempts  have  been  made  to  modify  them  to  make  them  behave  more  like  quasi-Newton 
methods.  This  group  of  methods  is  generally  referred  to  as  Umited-memory  quasi-Newton 
methods.  Since  the  standard  quasi-Newton  updates  described  in  section  2.3  only  involve 
a  few  low-rank  updates  to  an  initial  matrix,  it  is  possible  to  apply  them  to  a  vector 
and  only  store  the  few  vectors  needed  to  define  the  low-rank  portion  of  the  update.  The 
storage  available  determines  the  number  of  updates  used.  Because  the  size  of  the  problem 
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prevents  the  solution  of  the  system  of  equations  (2.3.2),  inverse  updates  are  generally 
used  in  this  application.  Details  of  these  algorithms,  as  well  as  a  great  many  numerical 
examples  illustrating  their  performance,  can  be  found  in  Gill  and  Murray  [1979]. 

Another  technique  which  has  been  used  to  improve  the  performance  of  nonlinear 
conjugate-gradient  algorithms  is  preconditioning.  These  algorithms  will  converge  in  one 
iteration  if  f{x)  is  a  quadratic  function  with  Hessian  matrix  equal  to  the  identity.  The 
idea  behind  preconditioning  is  to  use  information  about  the  problem  (obtained  either  a 
priori,  or  dynamically  as  the  problem  is  being  solved),  to  modify  the  original  problem 
at  each  iteration  so  that  it  behaves  more  like  this  model  problem,  and  hence  is  easier  to 
solve.  Because  preconditioning  is  an  idea  of  such  general  usefulness,  it  will  be  discussed 
in  considerably  more  detail  in  Chapter  5. 

Efforts  have  also  been  inade  to  extend  conjugate-gradient  methods  towards  Newton’s 
method.  This  work  comes  under  the  category  of  truncated-Newton  methods,  and  is  the 
main  subject  of  this  thesis.  The  basic  theory  behind  these  algorithms  will  be  outlined  in 
the  next  two  chapters. 
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3  Truncated-Newton  Methods 


3.1.  Introduction 

In  the  last  chapter,  we  discussed  the  three  traditional  classes  of  methods  for  solving 
the  basic  unconstrained  minimization  problem  (1.2.1).  All  the  methods,  as  we  saw,  were 
based  on  the  solution  of  the  Newton  equations  (2.2.3).  Truncated-Newton  methods  are 
no  exception.  In  a  sense,  they  are  the  converse  to  quasi-Newton  methods.  Quasi-Newton 
methods  compute  a  search  direction  by  exactly  solving  an  approximation  to  the  Newton 
equations,  whereas  truncated-Newton  methods  do  so  by  approximately  solving  the  exact 
Newton  equations. 

There  are  certain  advantages  to  this  approach.  First  of  all,  we  are  always  dealing 
with  exact  second-derivative  information;  in  other  words,  the  sub-problem  we  are  solving 
is  more  closely  related  to  the  actual  problem  we  are  interested  in.  Secondly,  since  we 
only  need  an  approximate  solution  to  (2.2.3),  we  can  use  an  iterative  method  to  solve 
it.  Iterative  methods  generally  have  very  low  storage  requirements,  and  do  not  explicitly 
require  the  Hessian  matrix. 

In  the  next  section,  we  will  briefly  describe  these  methods,  mention  some  of  the 
iterative  methods  that  have  been  proposed  for  solving  (2.2.3),  and  indicate  some  of  the 
problems  that  can  arise.  In  the  final  sections,  the  linear  conjugate-gradient  and  related 
algorithms  will  be  derived  via  the  Lanezos  algorithm,  and  it  will  be  shown  how  to  use 
these  algorithms  to  overcome  the  above-mentioned  problems.  Further  aspects  of  this 
class  of  algorithms  will  be  left  to  later  chapters. 

3.2.  Basic  Description  of  the  Method 

Because  Newton’s  method  is  based  on  a  Taylor  series  expansion  near  the  solution 
of  the  minimization  problem  (1.2.1),  there  is  no  guarantee  that  the  search  direction  it 
computes  will  be  as  crucial  far  away  from  x  .  In  fact,  at  the  beginning  of  the  solution 
process,  a  reasonable  approximation  to  the  Newton  direction  may  be  almost  as  effective 
as  the  Newton  direction  itself.  It  is  only  gradually,  as  the  solution  is  approached,  that 
the  Newton  direction  takes  on  more  and  more  meaning. 

This  suggests  using  an  iterative  method  to  solve  the  Newton  equations.  Moreover, 
it  should  be  an  iterative  method  with  a  variable  tolerance,  so  that  far  away  from  the 


24 


solution,  (2.2.3)  is  not  solved  to  undue  accuracy.  Only  when  the  solution  is  approached 
should  we  consider  expending  enough  effort  to  compute  something  like  the  exact  Newton 
direction.  As  we  approach  the  solution,  the  systems  of  equations  we  are  required  to  solve 
become  progressively  more  similar.  Consequently  it  is  possible  that  a  closer  approxima¬ 
tion  to  the  exact  solution  can  be  determined  with  no  increase  in  effort  by  utilizing  past 
information. 

Sherman  [1978]  suggested  using  Successive-Over-Relaxation  (SOR).  This  is  the 
simplest  of  a  whole  class  of  methods  that  have  been  found  to  be  effective  for  solving 
linear  systems  arising  in  partial  differential  equations.  However,  it  is  difficult  to  get  SOR 
methods  to  perform  well  on  general  problems.  Also,  they  appear  to  be  prohibitively 
expensive  to  use  in  the  context  of  truncated-Newton  methods.  The  number  of  linear 
sub-iterations  required  to  achieve  superlinear  convergence  increases  exponentially  at  each 
non-linear  iteration. 

Much  better  suited  for  this  application  is  the  linear  conjugate-gradient  algorithm. 
Although  it  is  ideal  for  problems  where  the  coefficient  matrix  has  only  a  few  distinct 
eigenvalues,  it  is  guaranteed  to  converge  (in  exact  arithmetic)  in  at  most  n  iterations 
for  any  matrix.  Thus,  the  type  of  exponential  growth  mentioned  above  for  SOR-type 
methods  is  impossible,  at  least  theoretically.  Also,  it  can  be  shown  that  the  linear 
conjugate-gradient  algorithm  is  optimal- in  a  sense  to  be  defined  later. 

A  requirement  of  both  of  these  methods  is  that  the  coefficient  matrix  must  be 
positive-definite.  As  remarked  earlier,  the  Hessian  matrix  is  only  guaranteed  to  be  positive 
semi-definite  at  the  solution  and  may  be  indefinite  elsewhere.  Thus,  whatever  iterative 
method  chosen  to  solve  (2.2.3),  it  must  be  able  to  detect  and  cope  with  indefinite  systems. 
This  is  very  closely  related  to  the  situation  with  Newton’s  method,  but  in  this  case,  we 
are  not  planning  to  perform  a  Cholesky  factorization  of  the  Hessian  matrix,  making  it 
difficult  to  modify  it  directly.  The  next  two  sections  will  describe  how  to  circumvent  this 
problem  in  the  case  of  the  linear  conjugate-gradient  algorithm.  Because  the  SOR-based 
methods  are  prohibitively  expensive  to  use,  even  in  ideal  circumstances,  they  will  not  be 
considered  further. 

Paige  and  Saunders  [1975]  have  developed  two  conjugatc-gradient-like  algorithms  for 
dealing  with  symmetric  indefinite  systems  of  equations.  The  first  of  these,  SYMMLQ,  is 
identical  to  the  traditional  conjugate-gradient  method  in  the  positive-definite  case,  and 
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is  not  of  much  interest  in  this  context.  The  second,  MINRES,  is  based  on  minimizing 
the  norm  of  the  residual  at  each  iteration.  It  produces  different  iterates  than  the  CG 
method,  and  has  many  properties  of  value  to  us  here.  It  will  be  discussed  in  section  3.7. 

Finally,  we  give  here  a  description  of  a  truncated-Newton  method  in  algorithmic 
form.  The  details  of  the  methods  used  to  iteratively  solve  the  Newton  equations  and  to 
precondition  the  algorithm  will  be  given  later. 

(3.2.1)  Truncated-Newton  Algorithm 
TNI.  Given  some  approximation  to  i*. 

TN2.  If  is  a  sufficiently  accurate  approximation  to  the  minimum  of 
F,  terminate  the  algorithm. 

TN3.  Approximately  solve  the  Newton  equations  (2.2.3)  using  some 
iterative  algorithm  with  preconditioning  is  chosen  with  infor¬ 

mation  from  some  non-linear  algorithm  or  from  previous  iterations  (see 
Chapter  5). 

TN4.  Using  the  search  direction  computed  in  step  TN3,  use  the  line 
search  algorithm  (1.4.3)  to  compute  a  new  point  Go  to  step  TN2. 

3.3  The  Linear  Conjugate-Gradient  Algorithm 

A  well  known  technique  for  the  solution  of  large  systems  of  linear  equations  is 
the  linear  conjugate-gradient  method  of  Hestencs  and  Stiefel  [1952].  This  method  can 
be  directly  applied  to  the  Newton  equations  (2.2.3).  The  linear  conjugate-gradient 
algorithm  is  particularly  appropriate  when  matrix-vector  products  of  the  form  can 

be  computed  even  though  the  matrix  or  its  factorization  cannot.  The  conjugate- 
gradient  algorithm  is  usually  derived  as  a  direct  method,  in  the  sense  that,  theoretically, 
the  solution  is  found  after  n  iterations  or  less.  However,  in  practice  the  algorithm  behaves 
more  like  an  iterative  method  since  it  computes  a  sequence  of  improving  esUinatcs  and 
has  the  potential  of  converging  in  much  more  than  n  iterations.  The  finite  termination 
properties  of  the  conjugate-gradient  algorithm  are  based  on  orthogonality  relations  which 
are  not  valid  in  finite  precision  arithmetic.  It  is  possible  to  perform  extra  computations 
in  order  to  recover  finite  termination,  but  this  is  expensive  both  in  terms  of  storage  and 
in  terms  of  operation  counts.  For  large  problems,  this  is  impractical.  Recent  work  of 
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Parlett  [1980]  and  others  has  attempted  to  overcome  this  difficulty  through  the  use  of 
selective  reorthogonalization. 

Although  most  of  the  literature  on  this  algorithm  refers  to  the  solution  of  the 
equation  Ax  —  b,  it  is  felt  that  using  this  notation  here  would  lead  to  unnecessary 
confusion.  In  order  to  be  consistent  with  the  other  parts  of  this  thesis,  we  shall  solve  the 
system  of  linear  equations 

Gp  =  b,  (3.3.1) 

where  G  is  an  n  X  n  positive-definite  matrix.  We  shall  use  {p,}  to  denote  members  of 
an  iterative  sequence  intended  to  solve  (3.3.1).  It  will  be  assumed  that  all  the  operations 
are  performed  in  exact  arithmetic. 

The  conjugate-gradient  algorithm  can  be  derived  by  finding  iterates  that  minimize 
the  quadratic  function  Q(p)  =  ^p'^Gp  —  b'^p. 

Let  pg  be  the  g-th  approximation  to  the  minimum  of  Q(x)  and  let  v, 

be  q  linearly  independent  vectors  that  span  a  subspace  "Vg.  The  minimum  of  Q{x)  may 
be  computed  by  minimizing  Q{x)  oyer  an  expanding  sequence  of  linear  manifolds  that 
eventually  contains  3?”.  If  V,  denotes  the  matrix  with  columns  vi,V2,  > . .  ,Vg  then  the 
minimum  of  Q{x)  over  the  manifold  p,  -t-  T,  is  given  by  the  solution  of  the  problem 


min  Q{pg  +  Vgw). 

If  Pg  +  VgW  is  substituted  into  the  quadratic  function  we  find  that  the  optimal  w  minimizes 


the  function 


iw’^'^GVgW  +  w’^V'^Tg, 


where  r,  =  VQ(pg)  =  Gp,  —  b.  This  quadratic  function  has  a  minimum  at  the  point 
~{V'^GVg)~^V'^rg  and  consequently,  the  required  minimum  over  the  subspace  is  given 

P,+  i=P,-n(V[GV,)-iy^r,. 

Note  that  r,+i,  the  gradient  of  Q{p]  at  p,+i,  is  orthogonal  to  the  columns  of  Vg  since 

V^r,+i  =  -  i) 

=  -V^OV,(V^aV,)-'Y-;T,  +  Dfr, 


The  definition  of  p,-)-i  as  a  minimum  over  the  manifold  p,  -f  "V,  has  special  significance 
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if  each  previous  iterate  pj  is  obtained  as  the  minimum  over  Pj—\  +  Vy_i.  In  this  case 
r,  will  be  orthogonal  to  all  the  columns  of  V,  except  the  last,  and  the  minimum  on  the 
subspace  “V,  is  given  by 

P,+i  =  P,-V,{V;GV,)-'V^r, 

=  p,  +  ^v,^vPav,)-'€,,  (3.3.2) 

where  7  =  —‘rjvg  and  c,  is  the  g-th  column  of  the  identity  matrix. 

Suppose  that  the  columns  of  V,  are  defined  by  the  Lanezos  recurrence  relations 
(Lanezos  [1950]).  In  this  case  we  start  with  some  vector  Vj,  vjv^  —  1,  and  form 

Pj+iVj+i  =  Cvj  -  ajVj  -  pjVj-u  otj  =  VjGvj,  (3.3.3) 

where  Vq  =  0,  and  [Pj'+i  >  0)  is  chosen  so  that  ||t;y+i||2  =  1.  After  the  q-ih  step 

GV,  =  (3.3.4) 

where 

/  Oil  p2 

/?2  «2  Pz 

Pz  «3  • 

r,=  •  .  • 

V  Pq  Olg 

=  and  F^v,+i=0. 

The  process  will  be  terminated  at  the  first  zero  Pj,  so  that  in  general  we  assume  that  Pj 
is  nonzero  for  j  =  1, 2, . . . ,  g.  In  this  case  V^GVq  =  and  (3.3.2)  becomes 

Pq+i  =Pq  +  7V,T-‘e, 

=  Pq  +  lVg{L^)-^D-^L-^e,^, 

where  L,  and  D,  are  the  Cholesky  factors  of  T,.  Since  L,  has  unit  diagonal  elements, 
=  e^.  Consequently, 

Pi+i = p, + JW'fr'v  p-3.5) 

=  Pg  +  OgUg, 
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where  =  ~^q'*’qldq  and  Uq  is  given  by  the  g-th  column  of  the  matrix 

Uq  =  Vq{L-;)-K 

Paige  and  Saunders  [1975]  show  that  the  columns  of  Uq  can  be  computed  from  the 
recurrence  relations 

=  Vl,  Ug  =  Vq-  IqUq-l, 

where  is  the  (g  —  l)-th  sub-diagonal  element  of  the  lower  bi-diagdnal  matrix  L,. 

When  the  vector  Vi  used  to  start  the  Lanczos  process  is  chosen  as  a  multiple  of  the 
right-hand-side  vector  h  and  when  xi  is  zero,  this  algorithm  is  mathematically  equivalent 
to  the  Hestenes-Stiefel  conjugate-gradient  algorithm.  The  derivation  here  emphasizes  the 
fact  that  a  whole  class  of  conjugate-direction  methods  can  be  generated  from  different 
choices  for  vi. 

It  is  well  known  that  in  general,  rounding  error  seriously  impairs  the  performance  of 
Lanczos  tri-diagonalization  by  causing  a  loss  of  orthogonality  in  the  vectors  {vj}.  This 
implies  that  the  matrices  VjGVq  will  no  longer  be  tri-diagonal  and  the  solution  Pg+i. 
over  the  subspace  V,  will  be  correspondingly  inaccurate.  The  effects  of  this  error  are 
noticably  reduced  if  the  starting  vector  vi  is  taken  to  be  a  multiple  of  b.  The  reason  for 
this  is  that,  in  the  Paige-Saunders  algorithm,  each  p,  is  algebraically  of  the  form  V,p, 
where  Tqy  —  Pici.  From  (3.3.4)  we  have 

GVqV  =  V^T^y  + 

to  working  precision,  and  consequently, 

Gpq  =  PiVqCi  +/3,+iV,+ie^p 

=  6-1-  /3,+iV,+ie^p.  (3.3.6) 

This  expression  does  not  depend  at  all  upon  the  orthogonality  of  the  Lanczos  vectors 
and  indicates  that  p,  will  be  the  solution  of  a  problem  with  right-hand  side  that  differs 
from  the  true  6  by  Pg^iVg_^_^e^y,  a  quantity  that  will  ultimately  be  sufTiciently  small. 
Unfortunately,  a  relationship  analogous  to  (3.3.6)  does  not  hold  for  arbitrary  vi. 

In  the  light  of  these  remarks,  we  shall  always  use  the  term  Lanczos  iteration  to  refer 
to  the  recurrence  relations  (3.3.3)  with  vi  defined  as  a  multiple  of  6.  Although  this  is 
mathematically  equivalent  to  the  linear  conjugate-gradient  algorithm  of  Hestenes  and 
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Steifel,  this  derivation  allows  us  to  modify  the  algorithm  in  the  case  when  the  Hessian 
matrix  G  is  not  positive  definite. 

3.4.  Indefinite  Systems 

When  the  matrix  G  is  indefinite,  Paige  and  Saunders  note  that  the  conjugate- 
gradient  method  is  unstable  and  propose  a  modified  Lanczos  method  based  on  the  LQ 
factorization  of  T,  rather  than  the  Cholesky  factorization.  In  the  context  of  minimization, 
however,  even  the  exact  solution  of  an  indefinite  system  is  of  little  practical  value  since 
the  resulting  direction  of  search  is  likely  to  be  a  non-descent  direction.  As  in  the  case  of 
modified-Newton  methods  for  the  factorization  of  G^^\  we  can  make  better  use  of  the 
solution  of  a  neighboring  positive-definite  system. 

The  proposed  method  is  based  on  the  following  theorem.  We  shall  assume  that 
the  Lanczos  iteration,  when  applied  with  exact  arithmetic  to  a  symmetric  matrix  G, 
terminates  at  iteration  s  (s  <  n),  i.e.,  fia+i  vanishes.  As  in  the  last  section,  we  shall  use 
Tg  and  V,  to  denote  the  q  X  q  and  n  X  q  matrices  associated  with  the  q-th  stage  of  the 
Lanczos  iteration. 

Theorem  (3.4.1)  Let  =  diag(en, €22,- ••,««»)  be  a  diagonal  matrix  with  non¬ 
negative  entries  and  let  Eg  denote  the  q  X  q  principal  sub-matrix  of  E^.  If  the  matrix 
Tg  +  Eg  is  positive  definite  with  Cholesky  factors  L,  and  Dg,  the  iteration 

Pi  =  0>  P^+i  =  Pg  +  ^^q> 

where  cr,  =  —v^rjd^,  solves  the  linear  system. 

Proof  Let  G  denote  the  matrix  G  -k  VgEaV^.  Using  the  orthogonality  of  the  Lanczos 
vectors 

V'^GVg  =  V'^GVg  +  E, 

=  Tq+Eg 

=  (3.4.2) 

We  can  apply  the  ideas  of  Section  3.3  to  generate  the  sequence  {p^}  defined  by  the 
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recurrence  relations 


(3-4.3) 

that  will  solve  the  linear  system  Gp  —  b. 

Substituting  the  expression  (3.4.2)  for  V^GVg  in  (3.3.5)  and  using  a  similar  analysis 
to  that  used  to  obtain  (7)  we  find 

Pq+l  P?  4*  ^qi 

where  =  —v^r^jdq.  The  scalar  is  computed  from  r,  using 

=:<(Gp,-6)  +  t,[V,ii;,y^p, 

=  <r,  +  e„u[p,. 

Since  pis  in  the  span  of  and  V,  is  an  orthogonal  matrix,  the  second  term 

on  the  right-hand  side  vanishes.  This  proves  the  theorem.  | 

Corollary  (3.4.4)  The  vector  p  obtained  from  the  recurrence  relations  defined  in 
Theorem  1  satisfies  the  positive-definite  system 

(G  -I-  VEV'^)p  =  b, 

where  \\VEV'^\\  =  ||E||. 

Proof  If  s  is  equal  to  n  the  corollary  follows  trivially.  If  s  <  n  then  V  will  be 
the  matrix  [¥,  V),  where  V  is  the  orthogonal  complement  of  V,.  The  remaining  n  —  s 
diagonal  elements  of  E  are  arbitrary  and  may  be  chosen  so  that  A  V EV^  is  positive 
definite.  | 

The  modified  Lanezos  method  may  be  applied  directly  to  the  Newton  equations 
(2.2.3).  When  exact  arithmetic  is  used  and  the  Lanezos  iteration  is  continued  until  /3,+i 
is  zero  (i.e.,  q  steps  are  computed),  the  nonlinear  algorithm  is  a  modified  Newton  method 
with  a  positive-definite  approximate  Hessian 

gc*)  + 

where  is  the  matrix  VgE^V"^.  Note  that,  unlike  the  modification  produced  by  the 
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direct  application  of  the  modified  Cholesky  factorization,  is  not  a  diagonal  matrix. 
The  orthogonality  of  the  Lanczos  vectors  implies  that 

\\v,E^v-;\\  =  ii^,ii. 

Consequently,  when  the  diagonal  modification  to  T,  is  small,  the  modification  to 
will  be  small  also. 

The  truncated  Newton  method  of  Dembo  and  Steihaug  [1980]  “solves”  (2.2.3)  by 
performing  a  limited  number  of  iterations  of  the  linear  conjugate-gradient  method. 
The  iterations  are  terminated  (“truncated”)  before  the  system  is  solved  exactly.  The 
final  iterate  of  the  truncated  sequence  is  then  taken  as  an  approximate  solution  of 
(2.2.3).  If  a  single  linear  iteration  is  used,  will  be  the  steepest-descent  direction 
— if  the  sequence  is  not  truncated,  will  be  the  solution  of  (2.2.3).  Thus,  the 
algorithm  computes  a  vector  that  interpolates  between  the  steepest-descent  direction  and 
the  Newton  direction. 

Dembo  and  Steihaug  showed  that,  if  G^*'^  is  positive  definite  and  the  initial  iterate 
of  the  linear  conjugate-gradient  scheme  is  the  steepest-descent  direction  all  suc¬ 

ceeding  linear  iterates  will  be  directions  of  descent  with  respect  to  F[x), 

When  G^'')  is  not  guaranteed  to  be  positive  definite,  p^*)  may  not  be  a  descent 
direction.  We  propose  that  the  modified  Lanczos  algorithm  be  used  to  compute  the 
direction  of  search.  The  following  theorem  indicates  that  the  direction  of  search  obtained 
by  terminating  the  modified  Lanczos  scheme  will  always  be  a  direction  of  descent, 
irrespective  of  the  definiteness  of  G^^^. 

Theorem  (3.4.5)  Let  {p,}  denote  the  sequence  of  iterates  computed  by  the  modified 
Lanczos  algorithm  with  po  equal  to  the  zero  vector,  and  assume  that  7^  0.  Then 
<  0  for  all  9  >  0. 

Proof  When  po  is  zero,  p,  (9  >  0)  is  the  solution  of  the  minimization  problem 

min  ip^(G^*'^  -f-  G^*'^)p  +  p 

pev,-! 

where  is  some  modification  to  G^*'^  chosen  so  that  G^*^  -f  Z?^^^  is  positive  definite. 
Thus  * 
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3.5.  Computing  a  Modified  Factorizaton  of  a  Tridiagonal  Matrix 

At  the  first  stage  of  the  Lanczos  process  we  need  to  find  the  Cholesky  factorization 
of  the  1  X  1  “matrix”  given  by 

Tt  =  (ai). 

If  A  is  not  positive  definite,  aj  may  be  negative  or  zero,  and  it  is  replaced  by 

ai  —  max  {oi, 


where  ^  is  a  pre-assigned  small  positive  constant.  This  is  an  allowable  “diagonal” 
modification  to  Ti  since 

ai~ai+pi, 

where  pi  =  max  {0,  ^  —  ai}.  The  introduction  of  the  constant  6  is  necessary  to  bound 
the  factorization  away  from  singularity.  Usually  6  will  be  a  multiple  of  the  relative 
machine  precision. 

At  the  second  stage  of  the  algorithm  we  need  the  modified  factorization  of  the  matrix 


Let  the  Cholesky  factors  of  this  matrix  be  written  as 


(3.5.1) 


/ 1  0  Y  0  Y 1  h\ 

V  h  1  A  0  d2  A  0  1  J 


Straightforward  application  of  the  Cholesky  factorization  to  (3.5.1)  gives  di  =  on,  t/2  = 
max{(5,^},  and  I2  =  where  <j>  =  ^  negative,  the  matrix  is 

indefinite  and  a  diagonal  correction  must  be  made  in  order  to  ensure  sufficient  positive 
definiteness.  The  smallest  modification  to  the  (2,  2)  element  of  (3.5.1)  that  maintains 
positive  definiteness  results  from  redefining  ^2  as  6.  This  modification  adds  the  quantity 
— ^  ^  to  the  second  diagonal  element  of  (3.5.1).  Unfortunately,  this  algorithm  has  a 
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serious  disadvantage,  as  the  following  example  will  illustrate. 
Consider  the  Cholesky  factorization  of  the  matrix 


r  =  II  (3.5.2) 

In  this  case,  <f>  is  the  large  negative  quantity  1  —  1/^  and  the  diagonal  ^2  is  set  to  6.  This 
gives  the  factorization 


Note  that,  although  we  have  made  the  smallest  allowable  modification  to  d2,  we  have 
produced  a  very  large  diagonal  correction  and  the  modified  matrix  is  in  no  way  “close” 
to  the  original.  This  has  occurred  because  the  lower-triangular  element  has  been  allowed 
to  become  large. 

This  problem  does  not  arise  when  using  the  Gill-Murray  modified  Cholesky  fac¬ 
torization  (Gill  and  Murray  {1974a])  since  the  diagonals  are  adjusted  so  as  to  give  lower- 
triangular  elements  that  are  bounded  in  magnitude  by  an  a  priori  bound.  However,  in 
order  to  compute  this  bound  it  is  necessary  to  determine  an  accurate  bound  on  ||G|| 
which  may  not  be  possible  or  convenient  if  G  is  large  and  not  stored  explicitly. 

An  alternative  to  the  Gill-Murray  factorization  which  is,  nevertheless,  similar  in 
principle  requires  the  computation  of  the  factorization 


pi  f32\  pi  M_p  oy^i  oyi 
\  P2  “2  y  VO  P2  J  \  h  1  /V  0  d2  /  V  0  1  y 


so  that  the  quantity  ci  -t-  p2  is  minimized.  (We  use  a  different  notation  for  the  elements 
of  the  diagonal  modification  and  the  diagonal  factor  because  p2  and  6,2  may  be  modified 
again  at  the  next  stage  of  the  factorization.)  Thus  if  we  define  the  quantity 


r  =  max  {<t>,S}, 


we  need  to  solve  the  problem 

min  ai  -I-  p2 

subject  to  T  =  02  +  p2  —  +  <^i) 

CTl  >0,  P2  >  0. 
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(3.5.3) 


If  ^  ffi  =  /?2  =  0.  Otherwise  t  =  6  and  we  compute, 

<^1  =  |/?2|  -  Si,  and  P2  ==  S  -  a2  +  \P2\, 

which  are  the  optimal  values  of  ai  and  p2  ignoring  the  non-negativity  constraints.  If 
either  correction  is  negative  we  use 

ffi  =  0  and  P2  =  6  —  a2  +  =  ^  ~  (3.5.4) 

or 

=i3|/(«2-^)-«i.  and  p^  =  0,  (3.5.5) 

choosing  the  sensible  pair  that  minimizes  ai  4-  p2.  Since  <^  <  (5,  at  least  (3.5.4)  must  be 
feasible. 

When  this  modified  Cholesky  factorization  is  used  on  the  matrix  (3.5.2),  the  factors 
are  given  by 

c :)(; :)(:  :)=(: 

which  is  suitably  “close”  to  the  original  matrix. 

The  first  column  of  the  Cholesky  factorization  is  unaffected  by  subsequent  iterations 
and  consequently  I2  is  the  required  sub-diagonal  element  of  L  with 

eii  =  Pi +(Ti. 

The  next  stage  of  the  reduction  involves  the  matrix 

^  «2  ^3  \ 

I  P3  0-3  / 

where  6:2  =  a2  +  P2  and  by  construction,  Q2  >  6.  This  matrix  is  identical  in  structure 
to  (3.5.1)  and  so  the  process  can  be  continued  to  find  the  modified  factorization  of  T2, 
Tz,  ...  etc. 

Two  advantages  of  the  Gill-Murray  modified  Cholesky  factorization  are  that  (a) 
it  is  possible  to  bound  the  diagonal  modification,  and  (b)  it  is  possible  to  compute  a 
direction  of  negative  curvature  when  =  0  and  is  indefinite  (see  section  2.2). 
The  implementation  of  the  algorithm  as  described  in  their  work  requires  accurate  a  priori 
bounds  on  the  elements  of  the  matrix  In  our  case,  the  tridiagonal  matrix  Tg  is  being 
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factored  as  it  is  generated,  and  the  matrix  may  not  be  available.  It  is  still  possible, 
though,  to  bound  the  norm  of  the  modification  and  to  compute  directions  of  negative 
curvature  when  the  gradient  is  zero. 

Theorem  (3.5.6)  Let  T  be  a  symmetric  tridiagonal  matrix,  with  principal  i  X  i 
submatrix  T,-.  Assume  that  a  modified  Cholesky  factorization 

Ti  +  Ei  =  LiDiLj 

is  computed  using  the  algorithm  described  above.  If 

7.  =  max{|ay|} 

j<* 

ft  =  nnax{|^y|}, 
j<t 

where  {oij}  and  {ftj}  are  the  diagonal  and  subdiagonal  elements  of  T,  respectively,  and 
if  <5  is  a  positive  tolerance  for  zero,  then 

ll^tlb  <  i[S  +  7*  +  a]- 

Proof  (by  induction)  Initially,  Ti  =  (cti).  If  ati  >  6,  then  Ei  —  (0);  otherwise, 
El  =  {6  —  aj).  In  either  case,  ||i?il|  <  6  +,lai  |  =  1  •  [(5  +  7i  +  fi].  (The  norm  used  is  the 
2-norm.) 

Assume  that  ||J5,||  <  -f-  7,-  -f-  f,].  If  a,-  =  p,+i  =  0,  then  no  modification  is 

necessary,  ||i?i+i||  =  ||^i||,  and  hence  ||£^,+i|l  <  (t  -f  l)[i5  +  7,+i  -f  f,+i). 

Otherwise,  t  —  6  and  we  compute  Oi  and  pij^i  from  (3.5.3),  (3.5.4),  or  (3.5.5).  When 
(3.5.3)  is  used, 

ll-^t+ill  <  ll-C'til  +  ^  +  7t-i-i  +  ft+ij 
and  the  result  follows.  If  (3.5.3)  is  infeasible,  we  use  either 

Oi  +  Pt+l  =  Pl+il^i  -  =  '0 

or 

+  Pi+i  =  =  0, 

where  aj+i  =  Oi+i  —  8.  There  are  three  cases: 

(i)  max{a,-,a,+i}  <  |A+ih  In  Ih's  case,  |/3,+i|  -  a,  and  jA+il  -  a»+i 

are  both  positive,  so  that  (3.5.3)  would  have  been  feasible. 
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.  (ii)  l/^t+il  ^  First,  V  ^  lA+i|  “  “i+i-  Also,  because  (3.5.4)  is 
feasible,  we  can  conclude  that  ^  >  0. 

(in)  |A+i|  <  ttt+i:  First,  0  <  lA+il  ~  “t-  Because  a,-  >  0  (by 
construction)  and  ai+i  >  0  (by  assumption),  then  9  =  (a^/ai+i)^  >  0. 

We  are  trying  to  minimize  a,-  +  pj+i  =  min{<?,  ^},  subject  to  the  feasibility  constraints. 
From  the  case-by-case  analysis  above,  we  can  conclude  that  min{fl,  <  max{|;9,.(.i  |  — 
ai,\Pi+i\  —  tti+i},  and  hence 

ll^t+ill  <  +  ^  +  7t-i-i  +  ft-i-l' 

This  implies  that  ||F^t^i||  <  (i  +  1)[5  -f  7,-4.i  +  ft+i],  and  the  result  is  proved. | 

In  the  above  discussion,  it  was  assumed  that  the  modifications  to  the  diagonal 
elements  of  the  2X2  matrices  were  both  positive.  When  factoring  a  tridiagonal  matrix, 
a  diagonal  element  might  be  modified  twice,  and  in  this  case  it  might  be  possible  to  allow 
one  of  these  modifications  to  be  negative.  This  would  not  affect  the  a  priori  bound  on  E 
derived  above.  For  this  reason,  we  have  not  examined  this  possibility  further. 

3.6.  Computing  a  Direction  of  Negative  Curvature 

The  procedure  for  the  computation  of  the  direction  of  search  will  break  down  if 
llfifll  is  zero.  If  G  is  positive  definite,  then  a  solution  has  been  determined.  It  remains, 
however,  to  confirm  that  G  is  indeed  positive  definite.  Moreover,  if  G  is  indefinite,  further 
progress  can  be  made  by  moving  along  a  direction  of  negative  curvature. 

Suppose  that  HgH  is  small.  We  wish  to  determine  if  G  is  indefinite,  and  if  so,  compute 
p  such  that  p^Gp  <  0.  To  do  this,  we  choose  vi  randomly,  ||ui||  =  1,  as  the  initial  vector 
for  the  Lanczos  process.  The  Lanczos  iteration 

VjGVj  =  Tj,  T,=L,DjLj 

is  performed  as  long  as  Tj  is  positive  semi-definite.  (Parlett  [1980]  reports  that  good 
approximations  to  the  extreme  eigenvalues  can  be  obtained  in  2n^  iterations;  if  G  is 
indefinite,  one  of  the  extreme  eigenvalues  will  be  negative.)  Note  that,  because  we  are 
trying  to  determine  if  G  is  indefinite,  the  tolerance  S  should  be  set  to  zero. 

Assume  that  Tg,  q  <  2n^,  is  the  last  positive  semi-definite  matrix  that  occurs.  Also, 
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suppose  for  the  moment  that  /3g+i  is  non-zero.  Then  =  Lq+iDq+iL^^j,  where 

where  6^+1.=  (0, . . .  ,0, ;8,+i).  Note  that  only  the  last  diagonal  element  of  Dq  can  be 
zero.  In  order  to  determine  a  direction  of  negative  curvature,  we  perform  an  orthogonal 
spectral  decomposition  of  a^+i  )' 


V^,+  l  ttg-t-iy  Vo 


Q'^Q  =  I, 


with  X5+1  >  0.  This  is  possible  because  r,+i  has  exactly  one  negative  eigenvalue.  Let 
p  =  u,+  i,  where 


- T 

=  («1  I  •  •  •  I  Uq+l)  = 


«9+i  ==  912^9  +  922‘W5+i-  This  is  nearly  the  same  as  the  formula  used  to  compute 
the  linear  conjugate-gradient  search  direction,  so  that  no  additional  vector  storage 
is  required  to  implement  this  idea.  Then,  if  =  (0,  ...,0, 1)  and  Dq+i  = 

diag^di, . . . , dq—i,  Xg,  Xg+i}, 


^  j  y 

^q+l^q  +  l^  q+fi^q+i^q+l^q+^ 

_ j  ...  «,  «. _ 2* 

~  ®g+l^q+ll-^9+l^g+l-^g-(-l]-^g-H®9+l 

=  — Xg+1  <  0. 

Thus  p  is  a  direction  of  negative  curvature.  It  should  be  noted  that,  since  the  Lanczos 
algorithm  seeks  out  the  most  negative  eigenvalue  of  G,  the  direction  p  computed  in  this 
way  ought  to  be  an  excellent  direction  of  negative  curvature,  i.e.  p^Gp/||p||  will  be  close 
to  its  minimum  value. 

Suppose  now  that  )3,+  i  =  0.  If  T,  is  positive  definite,  and  provided  our  initial 
random  vector  docs  not  lie  wholly  in  the  space  spanned  by  the  eigenvectors  corresponding 
to  positive  eigenvalues,  then  G  is  positive  definite  and  we  are  at  a  local  minimum  of  the 
objective  function  F.  However,  if  the  initial  vector  docs  lie  in  the  positive  eigenspacc,  no 
such  conclusion  can  be  made.  To  guarantee  the  indefiniteness  of  G,  we  require  a  more 
complex  procedure. 

We  will  carry  out  the  Lanczos  procedure  using  a  series  of  initial  vectors.  The  first. 
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v\,  will  be  chosen  randomly  as  above.  If  the  Lanczos  algorithm  terminates  at  iteration 
q,  with  q  <  n,  and  T,  is  positive  definite,  then  we  choose  another  initial  vector  v\ 
orthogonal  to  and  run  the  Lanczos  algorithm  again.  We  continue  in  this 

way  until  either  we  encounter  an  indefinite  T",,  or  until  {v},  Gu},  Gv\, . . .}  spans 

5R”.  In  the  latter  case,  we  can  assert  that  G  is  positive  definite. 

This  procedure  is  impractical  for  large  problems.  In  the  worst  case,  it  will  take  a  full 
n  steps  to  determine  whether  G  is  indefinite.  Suppose  that  G  has  one  negative  eigenvalue. 
Also,  let  {wj}  be  an  orthogonal  set  of  eigenvectors  for  G,  with  u"  corresponding  to  the 
negative  eigenvalue.  Then  the  Lanczos  algorithm  will  converge  in  one  iteration  for  each 
v\,  and  all  n  starting  vectors  will  have  to  be  used.  When  n  is  large,  this  is  unsuitably 
expensive. 

If  G  is  indefinite  then,  due  to  the  influence  of  rounding  errors,  it  is  highly  unlikely 
that  the  Lanczos  algorithm  will  terminate  without  discovering  the  indefiniteness  in  G. 
Even  if  the  initial  vector  vj  contains  no  component  in  the  negative  eigenspace,  any 
rounding  error  would  almost  certainly  introduce  one.  This  would  then  allow  a  negative 
eigenvalue  to  develop  in  Tq  as  desired.  Thus,  the  worst  case  behavior  described  above  is 
unlikely  to  occur  in  practice.  This  justifies  using  a  single  starting  vector  when  seeking  a 
direction  of  negative  curvature. 

If  Tq  is  only  positive  semi-definite,  i.e.  d,— 0,  then  it  is  not  possible  to  determine 
a  direction  of  negative  curvature  for  the  quadratic  approximation.  If  we  set  p  =  Uq, 
the  g-th  linear  conjugate-gradient  search  direction,  then  p^Gp  =  0  using  an  argument 
similar  to  that  above.  Such  a  p  may  be  a  direction  of  negative  curvature  for  F,  even 
though  it  is  not  for  the  quadratic  subproblem.  This  would  depend  on  the  higher-order 
derivatives  of  F. 

3.7.  Minimum  Residual  Methods 

As  was  seen  above,  the  preconditioned  Lanczos  method  generates  a  tridiagonal 
matrix  as  a  projection  of  the  coefficient  matrix  in  (3.3.1).  In  the  previous  sections,  we 
used  this  tridiagonal  matrix  to  minimize  a  quadratic  function  related  to  our  original 
system  of  equations.  This  tridiagonal  matrix  can  also  be  used  to  implicitly  solve  the 
normal  equations 

G^’Gp  =  G^6. 
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This  idea  is  the  basis  of  the  minimum  residual  algorithm  MINRES  of  Paige  and  Saunders 
|1975]. 

When  minimizing  the  quadratic  function,  we  formed  an  LZ)L^  factorization  of  the 
tridiagonal  matrix  T.  In  thils  context,  it  is  more  appropriate  to  factor  T,  (or,  to  be  more 
exact,  r,  +  JB,)  as 

=  Q^Q,  =  I,  (3.7.1) 

with  Z,  lower  triangular.  The  bar  is  used  to  indicate  that  L,  differs  from  the  q  X  q 
leading  part  of  in  the  {q,  q)  element  only.  The  details  of  this  factorization  can  be 
found  in  the  Paige  and  Saunders  paper. 

If  we  carry  out  the  orthogonal  factorization,  we  obtain  that 

=  L,Ll  (3.7.2) 

where  L,  is  the  leading  q  by  q  part  of  L,+i.  If  we  project  the  normal  equations  onto  the 
space  spanned  by  the  columns  of  the  matrix  V,,  we  obtain 

=  V'^Gb,  p,  =  V,z,. 

The  right-hand  side  of  this  system  can  be  written  as 

V^Gb  =  ^iV'^Gvi  =  piT^ei. 

Using  (3.7.1)  and  (3.7.2),  we  obtain  the  following  system  of  equations  for 

=  PiLqQ^ei.  (3.7.3) 

But  we  can  write 

Liq  =:  LqDq,  Dq  =  diag(l,  1, . . . ,  Cq'jf 
and  while  Lq  is  nonsingular,  (3.7.3)  gives 

LqZq  =  PlDqQqCl  =  {ti,  .  ,  ,  ,Tq)  =  tq,  (3  7  4) 

Tl=/?lCl,  T,  =  t  =  2,...,9, 

SO  there  is  minimal  error  in  computing  l^Uq.  Clearly  cannot  be  found  until  the 
algorithm  is  completed,  but  it  is  not  really  needed;  instead  we  form 

■  Nq  =  [ni,  .  .  .  ,  Tlq]  =  Vqliq 
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column  by  column,  and  then 

=  VgZg  =  VgL-^L^Zg  =  Ngtg, 

where  tg  is  developed  in  (3.7.2),  and  the  superscript  ^  shows  that  this  is  the  vector  which 
gives  the  minimum  residual.  It  can  be  seen  that  this  formula  does  not  require  the  storage 
of  the  matrix  N;  all  that  is  needed  is  its  final  column. 

Because  we  will  wish  to  terminate  this  algorithm  based  on  the  norm  of  the  residual, 
it  is  important  that  this  quantity  can  be  computed  with  little  effort  and  without  requiring 
excess  storage.  It  follows  from  the  results  of  Paige  and  Saunders  that 

Ikfllw  = 

which  satisfies  our  requirements,  and  which  shows  clearly  how  the  residual  norm  decreases 
each  step.  Here,  |1-||m  refers  to  the  M-norm  of  a  vector,  where  M  is  the  preconditioning 
matrix  for  the  Lanezos  algorithm  (see  section  3.8  below). 

3.8.  Preconditioning  the  Lanezos  Algorithm 

When  exact  arithmetic  is  used  throughout,  the  number  of  iterations  required  to  solve 
a  linear  system  Gp  =  b  using  the  conjugate-gradient  method  or  the  MINIIES  algorithm 
is  equal  to  the  number  of  distinct  eigenvalues  of  G  (see,  for  example,  Luenberger  [1973], 
pp.  176-178).  Therefore,  the  performance  should  be  significantly  improved  when  the 
original  system  is  replaced  by  an  equivalent  system  in  which  the  matrix  has  many  unit 
eigenvalues.  The  purpose  of  preconditioning  is  to  construct  a  transformation  to  have  this 
effect. 

Let  M  be  a  symmetric,  positive-definite  matrix.  The  solution  of  Gp  =  b  can  be 
found  by  solving  the  system 

M-iGM-^y=^  M-h, 

and  forming  p  =  M~^y.  Let  R  denote  the  matrix  M~iGM~^;  we  have  M~^RM^  = 
M~^G  and  therefore  R  is  similar  to  M~^G  and  has  the  same  eigenvalues.  The  idea  is 
to  choose  M  so  that  as  many  of  the  eigenvalues  of  M~^G  as  possible  are  close  to  unity. 
This  is  roughly  equivalent  to  choosing  M  so  that  the  condition  number  of  M~^G  is  as 
small  as  possible;  the  matrix  M  is  known  as  the  preconditioning  matrix. 
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Given  a  preconditioning  matrix  M,  we  can  apply  the  Lanczos  algorithm  to  the 
transformed  system  without  forming  R  and  without  the  need  to  find  the  square  root 
of  the  matrix  M.  In  practice,  M  will  often  not  be  explicitly  available.  It  will  only 
be  available  as  an  operator,  and  all  that  will  be  possible  is  to  solve  systems  of  linear 
equations  with  coefficient  matrix  M. 

The  recurrence  relations  analogous  to  (3.3.3)  for  the  transformed  system  are 

Pj+iVj+i  =  M-^Gvj  -  ajVi  -  pjVj-i,  OLj  =  wjGvy, 

where  vq  and  ui  are  chosen  as  before.  Notice,  however,  that  for  the  preconditioned 
algorithm  the  vectors  Vi  are  normalized  so  that  IKIIm  =  1-  After  the  g-th  step  we  have 

GV,  = 

Note  that  the  matrices  M,  T,,  and  Vq  satisfy  the  relations 

V'^MVq  =  Iq,  and  V^GVq  =  Tq. 

This  preconditioned  Lanczos  algorithm  allows  us  to  solve  the  system  of  equations  (3.3.1) 
in  the  same  way  as  in  sections  3.3  or  3.7;  The  crucial  fact  in  the  derivation  in  those 
sections  is  that  the  matrix  Vq  transforms  the  matrix  G  to  tridiagonal  form.  This  is  still 
true  for  the  preconditioned  algorithm. 

We  shall  discuss  the  choice  of  preconditioning  matrix  in  chapter  5.  We  shall  be 
particularly  interested  in  using  a  matrix  M  that  is  an  approximation  to  the  inverse  of  G. 
This  matrix  can  be  obtained  using  information  from  a  nonlinear  conjugate-gradient- type 
method  together  with  information  from  previous  linear  subiterations. 
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4  Terminating  the  Linear  Algorithm 


4.1.  Introduction 

In  order  to  fully  define  the  simplest  form  of  a  truncated-Newton  algorithm,  all  that 
remains  is  to  state  how  to  terminate  the  linear  algorithm.  The  fundamental  results  in 
this  area  were  proved  by  Dembo,  Eisenstadt  and  Steihaug  [1980].  They  provide  very 
useful  guidelines  for  developing  practical  convergence  criteria. 

In  section  4.2,  the  results  of  Dembo  et  al.  will  be  described  from  the  point  of 
view  of  function  minimization  (they  were  originally  stated  in  the  context  of  solving 
systems  of  non-linear  equations).  Because  these  results  are  stated  in  terms  of  the  2-norm 
of  the  residual  in  (2.2.3),  they  are  not  directly  applicable  or  even  natural  when  used 
in  conjunction  with  methods  based  on  minimizing  a  quadratic  function.  More  useful 
extensions  of  their  results  will  be  proved  in  section  4.3.  In  section  4.4,  practical  stopping 
criteria  for  the  linear  algorithm  will  be  discussed.  Finally,  in  section  4.5,  truncated- 
Newton  algorithms  based  on  a  trust-region  approach  will  be  derived. 

4.2.  Termination  Based  on 

Actually,  the  results  in  this  section  involve  the  relative  residual 

rW/|[pW||,  (4.2.1) 

where 

rW  =  +  gW, 

The  relative  residual  in  (4.2.1)  is  used  because  it  is  scale  free,  i.e.  multiplying  the  objective 
function  F  by  a  constant  does  not  affect  its  value.  Note  that  since  -*■  0  and 

||p(*l|l  —*■  0,  then  —  (— -♦  0.  Since  — would  be  an  exceedingly  poor 
approximation  to  p^*^  to  use  within  the  algorithm,  clearly  it  is  necessary  to  scale  in  the 
manner  described. 

All  the  results  of  this  section  will  be  of  the  following  form:  the  linear  iteration  will 
be  truncated  if  the  current  estimate  of  the  search  direction  guarantees  that 

||gW||  ~  fc  =  0, 1,...,  (4.2.2) 
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where  {<l>k}  is  some  “forcing  sequence”.  The  algorithm  is  then  completely  defined  given 
that  the  sequence  {^k}  has  been  specified. 

Before  proceeding  with  the  main  result,  we  require  the  following  definition. 

Definition  (4.2.3)  G(x)  is  Holder  continuous  at  x*  if  there  exist  constants  p  €  (0, 1] 
and  L  so  that  for  all  y  in  a  neighborhood  of  x* 

\\G{y)-G{x*)\\<L\\y-x*\\^. 

We  are  now  in  a  position  to  state 

Theorem  (4.2.4)  Assume  that  F  :  S?"  S?  is  a  real-valued  function  such  that 

1)  There  exists  a  local  minimum  x*  of  F. 

2)  F  is  twice  continuously  differentiable  in  a  neighborhood  of  x* 

3)  G{x*)  is  nonsingular  (and  hence  positive  definite) 

and  that  the  truncated-Newton  sequence  converges  to  x* .  Then 

i)  if  liTnk-too4>k  =  0,  the  convergence  rate  of  and  will  be 

superlinear. 

In  addition,  if  G{x)  is  Holder  continuous  at  x*  with  exponent  p  €  (0,1],  then  for  some 
c>  0, 

ii)  if  ^k  <  the  sequence  converges  with  Q-order  (1  -I-  p); 

and 

iii)  if  {<i>k}  converges  to  0  with  R-order  (1  -I-  p),  then  {x^^i}  converges  to 
X*  with  R-order  (1  +  p).| 

The  proofs  of  these  results  can  be  found  in  Dembo  et  al.  [1980]. 

4.3.  Alternative  Assessment  Criteria 

In  order  to  approximately  solve  the  system  of  linear  equations  (2.2.3),  some  variant  of 
the  conjugate-gradient  algorithm  is  used.  Although  Theorem  (4.2.4)  is  useful  in  indicating 
when  to  stop  the  conjugate-gradient  iteration,  it  is  based  on  ||r,||  a  quantity  which  does 
not  decrease  monotonically  as  the  algorithm  progresses  (the  subscript  q  refers  to  the 
linear  conjugate-gradient  iteration).  The  algorithm  is  based  upon  the  minimization  of 
the  quadratic  function 

Q{p)  =  iP^Cp  -1-  p’’ff. 


44 


It  would  be  preferable  to  stop  the  algorithm  based  on  the  value  of  this  quadratic  function, 
since  it  is  a  closer  measure  of  how  the  conjugate-gradient  algorithm  is  converging.  Ideally, 
we  would  like  to  measure  convergence  in  terms  of  the  quantity 

lb  -  pH 

Ibll  ’ 

where  p  is  the  minimizing  point  for  the  quadratic  function  Q{p).  Since  this  is  unavailable 
during  the  computation,  this  is  not  possible.  However,  a  simple  substitution  shows  that 

lb-p||g  ^  Q{p)  -  Q{p) 

Ibllc-i  Q{p) 

In  this  section  we  will  show  how  to  use  this  relation  to  derive  a  practical  convergence 
criterion.  First  of  all,  two  lemmas  arc  required. 

Lemma  (4.3.1)  If  G  is  symmetric  and  positive-definite  then 

y'^G^y  <  \\G\\y'^Gy. 

Proof  We  can  assume,  without  loss  of  generality,  that  G  =  diag{X,}.  Then, 

y^G^y  =  Y!>yi^i 

^  'y  ''j.Vi  ^t^max 
^max  y  ^  Vi  Xt 
=  \\G\\y^Gy.t 

Lemma  (4.3.2)  If  G  is  symmetric  and  positive-definite  then 

y^Gy  <  \\G~^\\y'^G^y.t 

The  proof  of  this  lemma  is  almost  identical  to  the  previous  proof  and  is  therefore  omitted. 
We  now  move  on  to  our  main  result. 

Theorem  (4.3.3)  Suppose  Q{p)  =  iP^Gp  +  p^g,  where  G  is  symmetric  and  positive- 
definite.  Let  p  denote  the  point  that  minimizes  Q.  Then,  for  any  p, 

'  IIGj.  +  9ll=  <  2||G||  ■  (eW  -  Q{p)) 

(e(p)-e(p))<  iiic->ii-i|Gp+jf. 

Proof  We  will  prove  here  the  first  result,  which  relies  only  on  Lemma  (4.3.1)  above. 
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The  proof  of  the  second  result  (which  relies  on  Lemma  (4.3.2))  is  almost  identical. 

\\Gp  +  ffll^  =  (Gp  +  p)’’(Gp  +  g) 

=  {Gp-Gpf{Gp-Gp) 

=  (p-  p)G®(p  -  p) 

<  l|G||(p’’Gp-2p’’Gp  +  p’’Gp) 

=  l|G||((p’’Gp  +  2p’’p)  -  p^g) 

=  2||G||(Q(p)-e(p)).| 

The  significance  of  this  result  is  that  we  can  now  rewrite  Theorem  (4.2.4)  with  ||r,|| 
replaced  by  (Q(p,)  —  Q(p)y^^-  One  major  advantage  of  using  this  quantity  is  that  it 
decreases  monotonically  during  the  linear  conjugate-gradient  iteration. 

Clearly,  since  p  is  unknown  during  an  iteration,  we  cannot  make  direct  use  of  this 
quantity  within  an  algorithm.  We  require  a  convergence  test  which  only  involves  com¬ 
putable  quantities,  and  at  the  same  time  maintains  a  superlinear  convergence  rate  in  the 
outer  algorithm.  To  this  end,  we  will  now  examine  the  behavior  of  the  linear  conjugate- 
gradient  algorithm  more  closely.  Because  the  performance  of  the  outer  algorithm  is  now 
of  interest,  the  superscript  is  now  included  in  the  formulas. 

Hestenes  [1980]  has  shown  (p.  44)  that 

g('')(p,+i)  -  Q('‘Hp)  <  iic[QW(p,)  -  Q(")(p)], 
where  K  =  (^^^)^  and  m.  At  are  the  extreme  eigenvalues  of  G^^K  Hence, 

9+l)  -  qW{p))<qW(p,)-qW(p,+,) 

From  this  we  conclude  that  it  is  possible  to  achieve  superlinear  convergence  by  insuring 
that 

[e'‘>(p,)-e'‘>(p,+i)]  =  o(ll9'*’ll)-  (<-3.4) 

The  significance  of  this  result  is  that  all  the  quantities  involved  can  be  computed  during 
the  course  of  the  linear  conjugate-gradient  iteration.  Thus,  this  is  a  practical  way  of 
being  able  to  guarantee  superlinear  convergence.  It  is  also  attractive  because  it  is  based 
on  the  successive  values  of  the  quadratic  function  minimized  by  the  linear  CG  algorithm, 
and  not  on  successive  values  of  the  residual.  Unfortunately,  like  the  norm  of  the  residual, 
it  does  not  decrease  monotonically  during  the  iteration. 
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When  using  the  linear  conjugate-gradient  algorithm  to  solve  Gp  =  —g,  we  clearly 
have  that  ||pol|  =  ||— ff||  and  that  ||Poo||  =  ^  •  llffll'  This  implies  that 

we  can  rescale  (4.3.4)  and  obtain  the  following  two  equivalent  convergence  tests: 


I0'‘l{p,)  -  <3<*)(p,+,)|S  =  <>(|e<‘>(p,)|*)  (4.3.5) 

ie'‘'(p,)-e'‘'(p,+.)l»  =  o(ie<‘>(p,+oi*)-  .  (4.3.6) 


These  measure  the  current  reduction  in  the  quadratic  function  against  its  initial  value 
and  its  latest  value,  respectively. 

There  is  one  further  extension  of  the  convergence  results  which  is  of  interest  when 
truncated-Newton  methods  are  used  in  conjunction  with  the  linear  algorithm  MINRES. 
This  concerns  the  use  of  norms  which  vary  from  iteration  to  iteration. 

When  MINRES  is  used  with  preconditioning  (see  Chapter  5  for  details)  the  progress 
of  the  linear  iteration  is  assessed  using 


where  ||•||A^■  is  defined  by 


llrWllMW 

IMIL  =  y'^^y 


(4.3.7) 


and  is  a  sequence  of  positive-definite  matrices.  From  the  way  in  which  the 

MINRES  algorithm  is  derived,  it  is  clear  that  (4.3.7)  has  the  desired  monotonicity 
property  which  the  original  convergence  criterion  and  (4.3.4)  lack. 

We  shall  assume  that  there  exists  a  uniform  finite  upper  bound  on  the  eigenvalues 
of  That  is,  there  exists  X^ax  such  that 


X(M(''))  <  X„,ax  <  00,  V*,  (4.3.8) 

where  X(A/^*'^)  is  any  eigenvalue  of  the  matrix  From  the  way  in  which  the  sequence 

is  constructed,  it  will  follow  that  there  is  a  uniform  positive  lower  bound  as  well. 
Thus,  there  exists  X^in  such  that 


0  <  X,„i„  <  X(MW).  (4.3.9) 

Using  (4.3.8)  and  (4.3.9)  it  is  clear  that 

Xiijipll  <  \\y\\^^,,  <  Xi,J|2/||,  VA 
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and  hence  that 


A„l„yi|rW||  llrWlU,., 

WJ  ||9“>ll  -  Il9<‘>ll„,«  -  ||9<‘>ll' 

From  this  it  follows  that  the  ratio  (4.3.7)  can  be  used  to  terminate  the  linear  iteration. 
4.4.  Practical  Forcing  Sequences 

The  forcing  sequence  in  the  convergence  test  can  be  chosen  in  accordance  with 
Theorem  (4.2.4).  This  theorem  shows  how  to  obtain  linear,  superlinear,  or  quadratic 
convergence  by  appropriately  defining  <f>k. 

If  we  select  0<^fc  =  c<lor0<c<(^jt<d<l,  then  the  rate  of  convergence 
will  be  linear.  If  <j)k  >  1,  then  =  0  is  a  valid  search  direction  and  no  step  is  made; 
hence  we  must  choose  <  1.  The  reason  for  choosing  <  1  is  to  insure  that  at  least 
one  linear  iteration  will  be  performed  and  that  a  direction  other  than  the  stecpest-descent 
direction  will  be  selected  if  possible.  When  a  preconditioning  strategy  is  employed  (see 
Chapter  5),  this  condition  could  be  relaxed. 

For  superlinear  convergence,  we  must  have  that  ->  0  as  Jk  — ►  oo.  Dembo  et  al. 
[1980]  have  suggested  using 

<f>k  =  l/fc.  (4.4.1) 

This  sequence  converges  to  zero  quite  slowly,  so  that  the  convergence  test  for  the  linear 
iteration  is  not  overly  stringent. 

Quadratic  convergence  can  be  attained  if 

<t>k  <  (4.4.2) 

Away  from  the  solution,  will  often  be  greater  than  one,  so  that  setting  <f)k  = 

may  not  lead  to  convergence.  This  suggests  combining  (4.4.1)  and  (4.4.2)  to  obtain 

<f>k  =  min  |l/fc,  (4.4.3) 

as  a  forcing  sequence.  Hereafter,  we  will  refer  to  (4.4.3)  as  the  “standard”  forcing 
sequence. 

It  may  be  desirable  or  necessary  to  limit  the  number  of  linear  iterations  allowed 
during  each  outer  iteration.  Because  the  linear  inner  algorithm  converges  at  a  linear 
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rate,  i.e. 


Q,+i-Q*  <k[q,-q% 

then  setting  an  upper  bound  of  L  linear  iterations  leads  to  a  linear  rate  of  convergence 
for  the  total  algorithm. 

Numerical  tests  of  various  forcing  sequences  are  conducted  in  Chapter  7. 


4.5.  Trust-Region  Methods 

The  quadratic  approximation  (2.2.2)  to  the  objective  function  is  clearly  not  valid  for 
all  values  of  p.  Up  until  this  point,  a  line  search  algorithm  has  been  used  to  monitor  the 
effectiveness  of  the  approximation  and  to  correct  for  its  deficiencies.  Another  approach 
to  this  problem  is  to  use  trust-region  methods. 

Trust-region  methods,  like  line-search  methods,  are  concerned  with  minimizing  the 
quadratic  approximation  (2.2.2).  But  unlike  line  search  methods,  a  constraint  is  added 
to  the  subproblem  which  involves  an  estimate  of  the  size  of  the  region  where  (2.2.2) 
adequately  predicts  the  decrease  in  value  of  the  objective  function  F.  In  exact  terms, 
then,  we  seek  to  solve 

minQ^*'^(p)  =  minF^*'^  -f  (4.5.1) 

p  p 

subject  to 

IIpII  <  tfW.  (4.5.2) 

In  order  to  obtain  a  solution  to  this  problem,  the  constraint  is  usually  rewritten  as 
iP^P  ^  This  transforms  (4.5.1)  and  (4.5.2)  into  a  convex  programming  problem. 

The  global  solution  is  obtained  by  finding  p  and  X  such  that 

j,(fc)  +  GWp+Xp  =  0 

iP^P  —  <0  ,  V 

X  >  0. 

The  problem  (4.5.3)  is  usually  solved  by  computing  some  estimate  of  the  Lagrange 
multiplier  X  and  solving  the  system  of  equations 

-1-  X/^^)  =  -gW.  (4.5.4) 
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A  check  is  then  made  to  insure  that  the  constraint  (4.5.2)  is  satisfied.  If  not,  a  new  X  is 
computed  and  (4.5.4)  is  employed  once  again. 

The  final  step  in  a  trust  region  iteration  involves  computing  and  This 

step  is  usually  based  on  the  value  of  the  ratio  between  the  actual  function  decrease 
and  the  predicted  decrease: 

P  ir(k)  _  Q(*)(a;(fc)  +  p(fc))  •  ^  ^ 

Generally,  the  larger  the  value  of  p^'‘\  the  more  adequately  the  constrained  subproblem 
(4.5.1),  (4.5.2)  indicates  a  decrease  in  the  objective  function.  Thus,  if  is  large,  a 
step  is  made  to  the  new  point  If  is  especially  large,  the  trust 

region  parameter  6^'^^  will  be  increased,  indicating  greater  confidence  in  the  quadratic 
approximation.  On  the  other  hand,  if  p^''^  is  small,  no  step  will  be  made  (i.e.  x(*'+‘)  = 
x^*'^)  and  will  be  decreased. 

The  above  is  intended  as  a  general  outline  of  trust  region  methods.  More  detailed 
information  as  well  as  exact  computational  formulas  can  be  found,  for  example,  in  Vardi 
[1980]  or  Hebden  [1973]. 

When  the  Newton  equations  (2.2.3)  are  being  solved  iteratively,  repeated  solution 
of  (4.5.4)  is  impractical.  Steihaug  [1980]  has  shown  that  use  of  (4.5.4)  can  be  avoided 
if  the  constraint  (4.5.2)  is  used  to  terminate  the  linear  conjugate-gradient  algorithm. 
Steihaug’s  work  was  done  in  the  context  of  truncated  quasi-Newton  methods  (similar 
to  truncated-Newton  methods,  except  that  an  approximate  solution  is  obtained  for  the 
quasi-Newton  equations  (2.3.3)  rather  than  the  Newton  equations  (2.2.3)),  but  his  ideas 
are  immediately  applicable  here. 

At  each  iteration  of  the  linear  conjugate-gradient  algorithm,  Steihaug  suggests 
monitoring  the  length  of  ||p,l|.  This  leads  to  the  following  formulas  and  tests: 

Pg  d” 

2.  If  l|Pfl+il|  <  then  continue  with  the  algorithm. 

3.  Otherwise  compute  t  >  0  such  that  [Jp,  -t-  t«,|1  =  set  = 

Pg  -h  TUg,  and  terminate. 

Steihaug  has  managed  to  show  that  the  resulting  algorithm  is  globally  convergent,  and 
is  able  to  prove  theorems  comparable  to  (4.2.4)  on  the  actual  rates  of  convergence  for 
various  forcing  sequences. 
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Little  computational  experience  has  been  reported  for  truncated-Newton  algorithms 
based  on  trust-region  strategies.  Although  they  show  obvious  promise,  they  lie  outside 
the  scope  of  this  thesis,  and  will  not  be  considered  further.  Much  of  the  work  used  in 
designing  linesearch-type  algorithms  is  directly  applicable  to  the  trust  region  approach. 
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5  Preconditioning 

5.1.  Introduction 

For  every  numerical  algorithm  there  is  an  ideal  problem.  For  Newton’s  method, 
the  ideal  problem  is  a  quadratic  function.  For  the  quasi-Newton  and  conjugate-gradient 
methods,  the  ideal  problem  is  a  quadratic  function  with  Hessian  matrix  equal  to  the 
identity.  More  generally,  we  often  think  of  the  class  of  problems  which  the  algorithm 
solves  well.  For  Newton’s  method,  this  is  the  set  of  functions  which  are  “nearly” 
quadratic.  For  quasi-Newton  and  conjugate-gradient  methods,  it  is  the  set  of  functions 
whose  Hessians  have  clustered  eigenvalues. 

Because  most  problems  are  not  ideal  for  an  algorithm,  it  is  important  to  have 
alternative  techniques  of  modifying  the  initial  problem  (without  altering  its  solution) 
so  that  it  is  easier  to  solve.  The  general  idea  is  to  make  the  given  problem  “closer”  to 
the  ideal  problem.  This  type  of  technique  is  called  preconditioning. 

Preconditioning  is  such  a  powerful  and  general  idea  that  there  exist  preconditioned 
versions  of  almost  every  known  numerical  algorithm,  both  direct  and  iterative.  Direct 
algorithms  often  use  preconditioning  to  reduce  the  error  in  the  computed  solution.  One 
common  example  of  this  is  the  use  of  column  scaling  in  Gaussian  elimination  (see,  for 
example,  Wilkinson  [1965],  chapter  IV).  Iterative  methods  generally  use  preconditioning 
to  speed  up  the  rate  of  convergence  (although  they  may  also  be  concerned  with  the 
condition  of  the  problem).  One  of  the  best-known  and  best-understood  examples  of  this  is 
the  generalized  (i.e.  preconditioned)  linear  conjugate-gradient  algorithm  (Concus,  Golub, 
and  O’Leary  [1976]).  A  brief  description  of  preconditioning  for  the  linear-conjugate 
gradient  algorithm  can  be  found  in  section  3.8. 

To  give  some  idea  of  the  versatility  of  this  concept,  it  is  possible  to  consider  Newton, 
quasi-Newton,  and  conjugate-gradient  algorithms  as  preconditioned  steepest-descent  al¬ 
gorithms,  with  the  preconditioning  being  generated  as  the  algorithm  proceeds  and  being 
modified  at  each  iteration. 

In  large  problems  where  it  is  expensive  to  compute  information,  it  is  important  to 
make  as  much  use  as  possible  of  every  computed  quantity.  This  generally  takes  the  form 
of  using  current  information  to  precondition  future  iterations. 

With  truncated-Newton  methods,  there  are  two  algorithms  to  be  concerned  with. 
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First  of  all,  there  is  the  outer  nonlinear  iteration.  In  the  basic  method,  this  is  just  the 
steepest-descent  method.  This  could  be  replaced  by  a  conjugate-gradient  or  a  limited- 
memory  quasi-Newton  method  (using  Newton’s  method  would  defeat  the  whole  purpose). 
This  idea  will  be  discussed  in  Section  5.2.  During  the  linear  algorithm,  matrix-vector 
products  involving  the  current  Hessian  matrix  are  computed.  It  would  be  desirable  to 
use  these  to  precondition  future  non-linear  and  linear  iterations.  This  is  the  subject  of 
Sections  3  through  6. 

We  believe  that  the  range  of  problems  for  which  a  truncated-Newton  method  will 
be  successful  will  be  extended  considerably  only  when  a  good  direction  can  be  produced 
in  a  small  number  of  linear  conjugate-gradient  iterations,  and  to  this  end  the  use  of 
preconditioning  is  essential. 

5.2.  Preconditioning  with  a  Non-Linear  Algorithm 

When  using  a  preconditioned  modified  Lanezos  algorithm  to  approximately  solve  the 
Newton  equations  (2.2.3),  at  each  iteration  it  is  necessary  to  solve  a  system  of  equations 

Mz  —  r 


involving  the  preconditioning  matrix  M.  Most  non-linear  optimization  algorithms  can 
be  viewed  as  computing  a  search  direction  by  solving,  possibly  implicitly,  a  system  of 
linear  equations 

Bp  =  -9, 

for  some  matrix  B.  Thus,  by  applying  the  formulas  for  the  non-linear  method  to  the 
vector  r,  it  is  possible  to  implicitly  define  a  matrix  M  which  can  then  be  used  as  a 
preconditioning  matrix  in  the  linear  algorithm. 

Setting  M  =  I,  i.e.  using  an  unpreconditioned  algorithm,  corresponds  to  the 
steepest-descent  method.  Another  possible  preconditioning  matrix  for  this  system  is 
an  r-step  limited-memory  quasi-Newton  matrix  H.  As  we  approach  the  solution,  and  F 
looks  more  and  more  like  a  quadratic  function,  a  small  number  of  quasi-Newton  steps  can 
often  produce  a  search  direction  which  is  much  superior  to  the  steepest-descent  direction 
or  to  a  traditional  non-linear  conjugate-gradient  direction  (see  Gill  and  Murray  [1979]). 

Using  a  quasi-Newton  preconditioning,  the  vector  will  be  the  first  non-trivial 

member  of  the  sequence  {p,}  and  this  direction  is  far  more  likely  to  give  a  good  reduction 
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in  the  function  than  Consequently,  even  if  the  linear  conjugate-gradient  algorithm 

were  terminated  immediately,  a  reasonable  search  direction  would  have  been  obtained. 


5.3.  Diagonal  Preconditioning  of  the  Nonlinear  Algorithm 

Nonlinear  minimization  algorithms  have  been  found  to  work  more  efficiently  if  the 
variables  are  properly  scaled.  This  means  that  all  of  the  variables  are  correctly  weighted, 
i.e.  that  a  unit  step  along  the  search  direction  will  approximate  the  minimum  of  the 
function  in  that  direction.  It  also  implies  that  the  tolerances  for  the  algorithm  have  the 
correct  scaling  (a  factor  even  for  the  more  scale-invariant  algorithms  such  as  Newton’s 
method).  One  way  of  achieving  this  is  through  a  diagonal  preconditioning. 

If  the  direction  of  search  is  obtained  from  the  quasi-Newton  equation  (2.3.2)  (which 
is  the  case  when  a  limited-memory  quasi-Newton  algorithm  is  used  as  a  preconditioning 
strategy),  the  BFGS  formula  (2.3.9)  may  be  simplified  so  that  the  matrix  does  not 
appear  in  the  rank-two  correction: 


^(fc-l-l)  ^  Q(k)  ^  - 1 - g(>‘)g(k)'^+  - i 

gW'^p(k)  aWyW'^pW 


This  result  implies  that  even  if  the  off-diagonal  elements  of  5^*'^  are  unknown,  the 
exact  diagonal  elements  can  still  be  recurred.  These  diagonal  elements  may  be  used  to 
precondition  the  conjugate-gradient  method.  Let  7y  and  V’y  denote  the  elements  of 
and  2/W  respectively.  If  Ajt+j  =  diag(5i, . . .  ,5„)  and  Afc  =  diag(<5i , . . . , ^„)  denote 
the  approximate  diagonal  Hessians  during  the  {k  +  1)^^  and  iterations  respectively, 
then 


Sj  —  Sj  + 


gW^pW 


This  diagonal  preconditioning  step  involves  an  approximation  to  the  diagonal  pf  G 
based  on  yW,  p(*),  and  In  the  linear  iteration  of  a  truncated-Newton  method, 

though,  matrix/vector  products  involving  G  are  computed.  It  would  be  desirable  to  use 
this  more  exact  second-derivative  information  to  compute  A,  the  diagonal  precondition¬ 
ing  matrix. 


Several  methods  of  computing  A  have  been  developed  and  tested.  The  first  two  are 
rank-one  and  rank-two  Quasi-Newton  updates  which  are  based  on  the  (false)  assumption 
that  G  is  a  diagonal  matrix.  A  third  is  a  BFGS  update  to  the  diagonal  of  the  approximate 
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Hessian.  In  addition,  it  is  possible  to  use  exact  information  about  the  diagonal  of  the  Hes¬ 
sian  either  to  precondition  the  linear  algorithm  or  to  initialize  the  linear  preconditioning. 
Note,  however,  that  even  if  matrix-vector  products  of  the  form  Gv  can  be  found,  it  may 
be  inconvenient  to  compute  Ga. 

At  each  linear  iteration,  a  computation  of  the  form 

y  —  Gs 


is  performed.  If  the  symmetric  rank-one  Quasi-Newton  update  is  rewritten  with  B 
replaced  by  the  diagonal  matrix  A,  we  obtain 


Ai+i  —  A,  -f 


(y  -  A,s)(y  -  A.s)^ 

[y  -  Ais)'^3 


Any  off-diagonal  terms  in  the  rank-one  term  are  ignored.  Notice  that  no  matrix  need  be 
stored  in  order  to  implement  this  update. 

A  similar  adaptation  can  be  performed  for  the  BFGS  formula.  This  time  the  result 

is 

A,>i  =  Av  —  ^j^^(A,s)(Ajs)  -f  yy  • 

Again,  off-diagonal  terms  in  the  rank-one  terms  are  ignored,  and  no  matrix  storage  is 
required. 

There  is  a  further  way  in  which  a  diagonal  quasi-Newton  update  can  be  used  to 
approximate  to  the  diagonal  of  G.  Because  the  linear  conjugate-gradient  algorithm  is 
equivalent  to  the  BFGS  algorithm  (when  applied  to  the  same  quadratic  with  Bq  =  7),  it 
is  possible  to  show  that  =  G.  Thus,  if  we  were  able  to  update  only  the  diagonals  of 
B,  at  the  end  of  n  steps  we  would  have  the  exact  values  for  the  diagonal  elements  of  G. 
Unlike  the  two  diagonal  updates  above,  this  will  be  an  exact  rather  than  an  approximate 
quasi-Newton  update. 

To  develop  this  update,  we  will  ignore  the  nonlinear  algorithm  for  the  moment,  and 
concentrate  our  attention  on  one  instance  of  the  linear  conjugate-gradient  method.  We 
are  attempting  to  minimize  the  quadratic  function 


<A(p)  =  ip'^Gp  +  p\ 


and  hence 

g{p)  ^  ^'[p)  =  Gp  +  c  =  -r(p). 


55 


where  r[p)  is  the  residual  at  p.  The  linear  conjugate-gradient  algorithm  is  initialized  with 
Po  =  0,  and  at  the  iteration  the  next  estimate  of  the  solution  is  computed  as 

P9+I  =  P,  +  OlgUg, 

where  Ug  is  the  search  direction  and  a,  is  the  step-length. 

The  BFGS  algorithm  computes  the  (same)  search  direction  using  the  formula 


BgUg  -  Qg, 


(5.3.1) 


where  g,  =  9[Pq)-  If  an  exact  line-search  is  used,  the  step-length  for  the  BFGS  algorithm 
is  the  same  as  that  for  the  linear  conjugate-gradient  algorithm.  Under  the  assumptions 
that  Po  =  0>  ^0  =  I)  and  that  the  new  approximate  Hessian  is  computed  using 


BgJfl  Bq  {BgSq)[BgSq)  -f  PgP,, 

PgSg 


(5.3.2) 


both  algorithms  compute  the  same  estimates  of  the  solution  at  every  stage. 

It  is  possible  to  adapt  (5.3.2)  so  that  only  the  diagonals  of  the  update  need  be 
computed.  Using  (5.3.1)  and 


we  can  conclude  that 


The  other  important  fact  is 


Sg  —  Pg+l  Pg  —  ®g'ag) 


BgSg  -  OCgQg. 


(5.3.3) 


Vq  =  ffg+1  -  ffg  =  aqGUg. 


(5.3.4) 


If  we  incorporate  (5.3.3)  and  (5.3.4)  in  (5.3.2),  we  obtain 


^9+1  ^9 


(5.3.5) 


Using  (5.3.5),  any  individual  element  of  Bg  can  be  individually  updated. 

■  When  the  linear  conjugate-gradient  algorithm  is  used  directly,  (5.3.5)  is  quite  ade¬ 
quate.  Unfortunately,  problems  arise  when  a  linearly  preconditioned  modified-Lanezos 
algorithm  is  used  instead.  First  there  is  the  problem  of  linear  preconditioning.  The 
correspondence  between  the  BFGS  and  the  linear  conjugate-gradient  methods  assumes 
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that  no  linear  preconditioning  is  used.  This  is  a  very  easy  problem  to  surmount,  since 
the  linear  conjugate-gradient  algorithm  preconditioned  by  the  matrix  M  is  equivalent  to 
the  BFGS  algorithm  initialized  with  Bq  =  M.  To  see  this,  replace  G  by  in 

the  above  derivation. 

The  other  problem  concerns  scaling.  When  the  linear  conjugate-gradient  algorithm 
is  implicitly  implemented  using  the  Lanezos  algorithm,  the  vectors  corresponding  to  the 
search  direction  and  the  residual  are  not  properly  scaled.  This  scaling  does  not  affect 
the  final  term  in  (5.3.5),  since  the  scaling  enters  equally  into  the  numerator  and  the 
denominator.  The  other  rank-one  matrix  is  affected,  however.  In  our  implementation 
of  the  algorithm,  the  correctly  scaled  residual  is  available.  This  leaves  only  the  inner 
product  ujrg.  Using  the  recurrence  relation  for  the  search  direction  Ug,  and  the  fact  that 
the  residuals  are  Af-orthogonal,  it  can  be  shown  that 

T  T 

‘^q^q  —  ^q'^qi 

where 

=  r,. 

Since  our  algorithm  computes  equally-scaled  multiples  of  z,  and  r,  as  well  as  the  correctly 
scaled  r,,  it  is  possible  to  correctly  compute  the  inner  product. 

Because  the  Hessian  matrix  is  not  always  positive-definite,  the  modified-Lanezos  al¬ 
gorithm  alters  the  subproblem  it  is  solving  when  it  runs  across  evidence  of  indefiniteness. 
The  preconditioning  scheme  is  trying  to  approximate  the  diagonals  of  the  actual  Hessian 
matrix,  and  two  of  the  preconditioning  algorithms  described  above  have  the  property  of 
hereditary  positive-definiteness,  so  there  is  some  question  as  to  what  should  be  done  when 
the  Hessian  matrix  is  modified.  We  have  chosen  to  omit  the  diagonal  update  whenever 
the  matrix  goes  indefinite.  Since  our  implementation  of  the  modified-Lanezos  algorithm 
only  performs  one  iteration  with  the  modified  matrix  before  returning  to  the  nonlinear 
algorithm,  very  little  second-derivative  information  is  wasted  using  this  approach. 

There  is  some  theoretical  evidence  to  indicate  that,  among  diagonal  preconditionings, 
this  final  preconditioning  strategy  is  the  most  effective.  Forsythe  and  Straus  [1955]  have 
shown  that  if  the  Hessian  matrix  G  has  property  A,  then  the  diagonal  of  G  is  the  optimal 
diagonal  preconditioning.  This  assumption  is  valid  for  many  problems  arising  in  partial 
differential  equations.  Also,  in  the  general  case,  van  der  Sluis  [1969]  has  proven  that 
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preconditioning  with  the  diagonal  of  G  will  be  nearly  optimal,  in  the  sense  that  the 
condition  number  of  G  preconditioned  by  its  diagonal  will  be  at  most  n  times  as  large 
as  the  condition  number  of  the  optimally  diagonally  preconditioned  G.  Thus,  estimating 
the  diagonal  of  G  using  the  BFGS  formula  (1.5)  should  be  effective  for  all  problems. 

Using  (5.3.5)  it  is  possible  to  compute  any  number  of  subdiagonals  in  addition  to  the 
main  diagonal.  Because  this  extension  is  so  straightforward,  the  details  will  be  omitted 
here. 


5.4.  Diagonal  Preconditioning  with  MINRES 


In  sections  5.2  and  5.3,  several  methods  for  diagonally  preconditioning  a  truncated- 
Newton  algorithm  were  described.  The  first  three  (the  non-linear  preconditioning  and 
the  rank-one  and  rank-two  diagonal  updates)  can  immediately  be  applied  to  MINRES 
since  they  do  not  rely  on  any  special  properties  of  the  underlying  linear  algorithm. 
However,  a  fourth  preconditioning  (a  BFGS  update  to  the  diagonal  of  the  approximate 
Hessian)  is  dependent  on  the  correspondence  between  the  BFGS  quasi-Newton  algorithm 
with  exact  line  searches  and  the  linear  conjugate-gradient  algorithm.  In  order  to  adapt 
this  preconditioning  strategy  to  MINRES,  we  must  analyze  the  correspondences  between 
MINRES  and  the  linear  conjugate-gradient  method. 

The  search  directions  in  MINRES  are  different  to  those  generated  in  the  linear 
conjugate-gradient  method.  Consequently,  at  first  sight  we  cannot  implement  the  fourth 
preconditioning  technique  which  relied  on  the  relationship  between  the  search  directions 
for  the  BFGS  algorithm  applied  to  a  quadratic  function  and  those  for  the  linear  conjugate- 
gradient  algorithm.  What  we  shall  show,  however,  is  that  from  information  available  in 
the  MINRES  algorithm,  we  can  easily  generate  both  the  required  search  directions  and 
the  required  vectors  to  update  the  Hessian  approximation. 

To  this  end,  we  define 

Wg  =  [u;i,...,«;,_i,W,]  =  VgQ'^, 

and 

Wg  = 


If  the  Lanezos  process  stops  with  Pm+i  —  0,  it  is  then  easily  verified  that 


GNrr,  =  =  ymQ'L= 


58 


(5.4.1) 


It  is  now  straightforward  to  establish  the  desired  correspondences.  Using  results  from 
Paige  and  Saunders  [1975],  we  obtain 


-  Pf  ==  T,(a,/c,)’*n,  =  7,n, 


(5.4.2) 


and 


=  -  V,4.l))/C,  -  ')gWg 

=  {-Pi9lS2-SqVg+l)ICg  = 


Recall  that  we  are  trying  to  compute 


B. 


9+1 


=  Bg- 


ufrg 


r,r[+ 


(Gu,){Gu,)’; 


(5.4.3) 


where  the  vectors  Ug  and  r,  refer  to  the  search- direction  and  the  residual  from  the 
linear  conjugate-gradient  algorithm.  Formula  (5.4.3)  indicates  how  to  compute  r,  for 
this  update.  Since 


P9+I  —  P9  d"  ^q'^q 


and 

pf+l  =  pf  +  ^q^q, 

we  can  subtract  these  two  equations  from  each  other  and  use  (5.4.2)  to  obtain 


“9^9  =  (7g+i  +  T,+i)n,+i  -  7,n,.  (5.4.4) 

Multiplying  (5.4.4)  by  G  and  using  (5.4.1)  we  obtain 

agGug  —  (7,4.1  +  T,4.i)Ti;,+i  -  7,10,.  (5.4.5) 

Consequently,  the  vector  Gug  need  not  be  calculated  directly.  This  is  of  particular 
significance  in  the  non-linear  algorithm  when  G  may  be  unknown.  Thus,  we  are  able  to 
compute  a  scaled  version  of  the  conjugate-gradient  search  direction.  Since  the  final  term 
in  the  BFGS  update  is  scale-invariant,  we  can  use  (5.4.4)  and  (5.4.5)  in  order  to  compute 
it.  This  is  not  true  of  the  first  term,  but  a,  can  be  computed  using 

’  io‘q'>^qV{o‘qGug)  (a?M,)’’(a,Gti,)  (a,tt,)r(a,Gu,)’ 

where  M  is  the  preconditioning  matrix  for  the  Lanezos  algorithm.  Combining  all  of  these 
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results,  we  obtain  the  desired  formula  for  the  BFGS  update: 

1  t  ^  \t  ^  \T 


*9+1 


Ba 


(a,u,)rr,  «+‘'^  (<,,u,no,Ga,) 


{^OCqGUq^^J^qGViq^  • 


5.5.  Tridiagonal  Preconditiong 

The  linear  conjugate-gradient  algorithm  transforms  the  Hessian  matrix  G  according 
to  the  formula 

GRq  =  VqTq  -|- 


Thus 


R'^GRq  =  Tq. 


This  suggests  the  use  of  the  preconditioning  matrix 

VqTqVj=Mq. 


The  matrix  Mq  is  rank  deficient  and  consequently  cannot  be  used  directly  as  a  precon¬ 
ditioning  matrix.  By  extending  the  definitions  of  Vq  and  Tq,  we  can  construct  a  precon¬ 
ditioning  matrix  utilizing  the  information  in  Mq. 

In  order  to  extend  Vq,  we  form  its  QR  factorization  (using,  for  example.  Householder 
transformation  (see  Wilkinson  [1965],  pp.  290-299)): 


V,  =  QR, 

where  Q  is  an  n  X  n  orthogonal  matrix;  R  is  of  the  form 


and  is  a  ik  X  Jfc  up>per-triangular  matrix.  If  we  partition  Q  conformally  to  R: 

Q  =  {Qi\  Qz), 


the  columns  of  Qi  span  the  same  space  the  columns  of  V,,  and  the  columns  of  span  its 
orthogonal  complement.  To  complete  the  extension  of  V,  to  the  whole  space,  we  define 


To  extend  T,,  we  exploit  the  convergence  theory  for  the  Lanczos  algorithm.  Parlett 
[1980]  has  shown  that  the  eigenvalues  of  T,  tend  toward  the  extreme  eigenvalues  of  G. 
It  is  natural,  then,  to  define 


where 

=  ^min(yg)  ^  7 
Some  possible  choices  for  7  are 

7  =  iiei  +  e„) 

and 

7  =  (ei  -Cn)*. 

The  full  preconditioning  is  then 

M,  =  QRTgR^Q'^. 

A  similar  tridiagonal  preconditioning  can  be  produced  by  using  the  approximation 

G^R~j2\R~\ 

where  R^  refers  to  the  extension  of  i?,  to  the  whole  space  (as  in  the  definition  of  P,). 

5.6.  A.pproximating  the  Product  of  the  Tridiagonal  Preconditionings 

Although  at  the  first  iteration  the  tridiagonal  matrix  from  the  Lanczos  algorithm 
has  eigenvalues  which  approximate  the  extreme  eigenvalues  of  G,  at  subsequent  iterations 
the  Lanczos  algorithm  is  being  applied  to  a  preconditioned  version  of  G  whose  extreme 
eigenvalues  may  bear  little  relation  to  those  of  G  itself— in  fact,  this  is  the  intent  of  the 
preconditioning  strategy.  One  attempt  to  surmount  this  problem  involves  computing 
the  product  of  the  previous  preconditioning  matrices  Mi,  and  using  this  product  as  the 
preconditioning  at  the  next  iteration.  Because  of  storage  limitations,  this  product  cannot 
be  computed  exactly,  and  an  approximation  to  it  must  be  used.  That  approximation  is 
the  topic  of  this  section. 

Suppose  we  have  already  computed 

fc-i 

^k-i  JJ  M,- 

»=i 
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in  the  factored  form 


Mfc_i  =  (Vfc_i 


yL,r- 


Also  assume  that  the  current  preconditioning  matrix  Mk  is  available  in  the  form 

Mu  =  (Vi  I  I  vi-r. 

~  i 

Here,  Lk  is  lower-bidiagonal  and  Lk-i  is  lower  triangular.  Then,  by  applying  on  the 
left  and  right  to  Mk-i,  we  obtain  an  approximation  to  Mk’ 


-  (V.  I  v^)(^‘ I  yi-f- 

If  we  treat  the  central  factors  in  this  product  as  block  2X2  matrices,  compute  their 
product,  and  ignore  off-diagonal  terms,  we  obtain 


Mk 


=  {Vk 


■ ) 


0  0 
0 


J 


(Vjc 


Vil-) 


(Ikh  0 

V  0 


vyr 


where  Lk  is  lower  triangular  and  7^  =  ')klk-i-  is  obtained  by  doing  a  Cholesky 
factorization  of  the  first  matrix  in  the  sum  above.  This  is  the  desired  approximation  to 
the  product  of  the  preconditioning  matrices. 
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6  Extensions  to  Other  Problems 


6.1.  Introduction 

So  far,  we  have  been  concerned  with  the  solution  of  unconstrained  minimization 

« 

problems  where  few  assumptions  have  been  made  about  the  form  of  the  objective  func¬ 
tion  F.  It  is  very  common  to  encounter  problems  where  auxiliary  conditions,  called 
constraints,  are  placed  on  the  independent  variables  x.  Another  common  problem  is  the 
minimization  of  functions  which  can  be  represented  as  the  sum  of  the  squares  of  other 
functions.  Such  problems  are  often  referred  to  as  least-squares  problems. 

In  this  chapter,  we  show  how  truncated-Newton  methods  can  be  adapted  to  solve 
problems  of  both  these  types.  In  sections  6. 2-6. 4  we  discuss  constrained  problems,  and 
in  section  6.5  a  treatment  of  least-squares  problems  is  given.  Since  successful  methods 
already  exist  for  dense  problems  of  these  types,  we  are  especially  concerned  with  large 
sparse  problems.  In  particular,  currently  it  is  difficult  to  solve  constrained  problems 
where  the  number  of  variables  n,  the  number  of  constraints  t,  and  their  difference  n  —  t 
are  all  large.  Truncated- Newton  methods  provide  some  hope  in  this  case. 

6.2.  Constrained  Minimization  Problems 

The  most  general  constrained  optimization  problem  can  be  expressed  in  the  form 

mini?’(i)  (6.2.1) 

X 

subject  to  the  conditions 

c,(x)  >0,  i  =  1, . . . ,  m.  (6.2.2) 

Here,  F[x]  and  Cj(i)  are  functions  mapping  from  S?”  —*■  5R. 

Problems  of  this  type  are  often  further  classified  by  the  form  of  the  constraints 

(6.2.2) .  The  major  division  is  made  between  linear  and  non-linear  constraints.  The 
constraints  are  also  further  divided  into  groups  of  equality  and  inequality  constraints. 

The  form  of  the  constraints  can  strongly  affect  the  way  in  which  the  problem  (6.2.1), 

(6.2.2)  is  solved.  Methods  for  problems  with  linear  equality  constraints  will  be  discussed 
in  section  6.3,  and  linear  inequality  constraints  in  section  6.4.  Methods  for  non-linearly 
constrained  problems  are  still  an  active  research  area.  It  is  not  yet  clear  how  best  to 
apply  Newton’s  method  to  such  problems  when  the  number  of  variables  is  large.  For 


63 


this  reason,  and  because  the  technical  details  of  such  methods  can  strongly  affect  the 
development  of  the  relevant  theory,  we  will  only  consider  the  application  of  truncated- 
Newton  methods  to  linearly  constrained  problems.  Much  of  what  we  describe  will  be 
relevant  to  the  non-linear  case. 


6.3.  Problems  with  Linear  Equality  Constraints 

The  problem  we  are  concerned  with  in  this  section  (denoted  hereafter  as  ECP)  is 
usually  written  in  the  form 

mini^{x)  (6.3.1) 


subject  to 


where  Ais  &t  X  n  matrix. 


Ax  —  b, 


(6.3.2) 


The  set  of  constraints  (6.3.2)  restricts  the  set  of  “feasible”  points,  i.e.  the  set  of 
points  which  will  be  considered  when  solving  (6.3.1).  As  usual,we  denote  the  solution  to 
ECP  by  X* .  If  X  is  any  other  feasible  point,  then 

A(Ai)  =  A{x*  —  i) 

=  Ax*  —  Ax 

=  6-6  =  0. 


Also,  if  p  is  any  vector  satisfying 


Ap  =  0, 


A{x*  -f  p)  =  Ax*  +  Ap 
=  6-1-0 
=  6, 

so  X*  -f  p  is  also  feasible.  Thus,  all  feasible  steps  from  the  point  x*  are  orthogonal  to  the 
rows  of  A  (or,  equivalently,  to  the  columns  of  A^). 

Let  Z  denote  a  matrix  whose  columns  form  a  basis  for  the  null  space  of  A^,  i.e. 
AZ  =  0.  Then  any  feasible  point  must  be  of  the  form  x*  +  Zpz  for  some  p^.  If  we 
examine  the  Taylor  series  for  F  around  the  point  x* ,  we  find  that 

F[x*  +  ep)  —  F{x*  +  cZpz)  =  F{x*)  +  €p'^Z'^g(x*)  -1-  ^p'^Z'^G{x*)Zpz  +••■.  (6.3.3) 

Clearly,  if  x*  is  the  minimizing  point  for  the  constrained  problem,  then 

Z'^g{x*)  —  0. 
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(6.3.4) 


The  quantity  Z'^g  will  be  referred  to  as  the  projected  gradient. 

The  condition  (6.3.4)  implies  that  g{x*)  lies  in  the  range  of  A’^: 

0  =  Z'^g(x*) 

=  Z’IaV  +  Zg,) 

—  Z'^Zgg. 

Hence,  gz  =  0.  The  coefficients  of  the  vector  are  denoted  and  are  called 

the  Lagrange  multipUera  at  x*. 

Using  (6.3.4)  and  (6.3.3)  we  find  that 

F{x*  +p)  =  F{x*)  +  ^—plZ'^G{x*)Zpz  +  •  •  • , 

so  that  the  matrix  Z^GZ  (the  projected  Hessian)  must  be  positive  semi-definite  at  x*. 

Thus  we  have  obtained  first-  and  second-order  necessary  conditions  for  a  solution 
to  problem  ECP.  Sufficient  conditions,  derived  in  the  same  way,  are  (if  x*  is  a  feasible 
point): 

(i)  Z'^g(x*}  =  0  (<7(i*)  =  AX*). 

(ii)  Z'^G(x*)Z  is  positive  definite. 

We  now  move  on  to  derive  methods  for  solving  ECP.  To  do  this,  we  return  to  the 
Taylor  expansion  (6.3.3).  The  function  F  is  expanded  about  an  arbitrary  feasible  point 
X  in  the  direction  Zpz  with  e  =  1,  and  the  series  is  truncated  after  the  quadratic  term. 
We  obtain 

F(x  +  Zpz)  =  F(x)  +  p^Z^g  + 

Setting  F(x  -I-  Zpz)  =  0,  we  obtain  that  the  step  to  the  minimum  of  this  quadratic  is 
found  by  solving  the  projected-Newton  equations: 

GzP  =  -gz,  (6.3.5) 

where 

Gz  =  Z'^GZ, 
gz  =  Z'^^g. 

Because  of  the  similarity  between  (6.3.5)  and  the  Newton  equations  (2.2.3),  it  is  easy 
to  derive  a  modified-Ncwton  algorithm  for  solving  ECP  by  specifying  that  G^^^  and 
be  replaced  by  G^z^  and  g^z^  in  all  the  relevant  formulas. 
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Truncated-Newton  methods  can  be  extended  just  as  easily.  All  that  is  required  is  a 
subroutine  to  compute  products  of  the  form 

y  = 

« 

for  any  v.  The  fact  that  a  projected  Hessian  is  being  used  is  irrelevant  to  the  algorithm. 

In  the  above  discussion,  we  have  not  stated  how  to  obtain  Z,  the  basis  for  the  null- 
space  of  A'^.  In  the  dense  case,  it  is  possible  to  derive  Z  from  a  LQ  factorization  of  A. 
(We  will  assume  that  A  has  full  row  rank.)  There  exists  an  orthogonal  matrix  Q  which, 
when  applied  to  A  on  the  right,  yields  an  lower-triangular  matrix: 

AQ  =  L  =  {L  0),  i.e.  A  =  Q'^L, 

where  Lis  a.  t  X  t  lower-triangular  matrix.  We  partition  Q  conformally  to  L: 

Q  =  {YZ), 

so  that 

AQ  =  {AYAZ)^{L  0). 

Thus,  the  last  (n  —  t)  columns  of  Q  are  orthogonal  to  the  rows  of  A,  and  Z  is  the 
desired  basis  for  the  null  space  of  A^.  The  LQ  factorization  above  can  be  computed 
using  elementary  orthogonal  transformations  (such  as  Householder  reflections  or  Givens 
rotations). 

In  the  sparse  case,  it  is  usually  preferable  to  use  a  technique  known  as  variable 
reduction  to  form  Z.  We  partition  the  constraint  matrix  A  in  the  form 

A  =  {TU), 

where  T  is  a  t  X  t  non-singular  matrix.  For  simplicity,  we  have  assumed  that  T  cor¬ 
responds  to  the  first  t  columns  of  A.  With  this  partitioning  of  A,  a  matrix  Z  orthogonal 
to  the  rows  of  A  can  be  defined  as 


If  Z  is  defined  in  this  way,  it  need  not  be  explicitly  formed.  Wc  need  only  bo  able  to  solve 
systems  of  equations  involving  T  and  so  that  all  that  is  required  is  a  factorization 
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of  T.  Note  that  the  matrix  T~^  is  not  obtained  explicitly.  To  compute  the  product  Z 
times  a  vector,  we  perform  back-substitution  using  the  factors  of  T.  This  allows  us  to 
exploit  sparsity  in  the  constraint  matrix,  although  in  general  the  condition  number  of  the 
matrix  Z  obtained  from  variable  reduction  will  be  larger  than  for  the  matrix  Z  obtained 
from  an  LQ  factorization  (in  that  case,  we  will  have  k.{Z)  =  1). 

6.4.  Linear  Inequality  Constraints 


6.4.1.  Theory 


The  problem  we  are  considering  in  this  section  (denoted  by  ICP)  will  be  posed  in 
the  form 


min  F{x) 


(6.4.1) 


subject  to 


Ax  =  6, 

I  <x  <  u, 


(6.4.2) 


where  A  is  an  m  X  a  matrix.  Notice  that  the  only  inequality  constraints  are  just  simple 
bounds  on  the  variables.  General  inequality  constraints  are  converted  to  this  form  by 
the  introduction  of  slack  variables.  This  problem  is  considerably  more  complex  than  the 
equality-constraint  problem  ECP  since  we  do  not  know  in  advance  which  bounds  (if  any) 
will  be  exactly  satisfied  as  equalities  at  the  solution. 

A 

If  A  were  the  set  of  constraints  active  at  the  solution,  then  the  solution  of  ICP 
would  also  be  the  solution  of  the  equality-constraint  problem  minF(x)  subject  to  Ax  = 
b.  This  suggests  applying  techniques  for  the  equality-constraint  case  to  ICP.  We  will 
obtain  the  solution  to  problem  ICP  by  solving  a  sequence  of  minimization  problems 
subject  to  linear  equality  constraints.  The  objective  function  (6.4.1)  remains  the  same 
in  all  of  these  subproblems,  but  the  constraint  matrix  is  modified  to  reflect  the  current 
assumptions  about  which  bounds  arc  satisfied  as  equalities  at  the  solution.  We  will  refer 
to  the  current  set  of  constraints  as  the  working  set  and  will  assume  that  they  correspond 
to  the  equation  Ax  =  b.  As  before,  Z  will  be  used  to  denote  a  matrix  satisfying  AZ  =  0. 

The  working  set  will  contain  a  subset  of  the  original  problem  constraints,  and  will 
attempt  to  predict  the  correct  active  set.  Since  the  prediction  of  the  active  set  could  be 
wrong,  an  active  set  method  must  also  include  procedures  for  testing  whether  the  correct 
prediction  is  correct  and  altering  it  if  not.  An  essential  feature  of  the  active  set  methods 
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considered  here  is  that  all  iterates  are  feasible. 

Following  the  development  in  Murtagh  and  Saunders  [1978],  the  matrix  A  is  parti¬ 
tioned  as 

A  =  {BSN)  (6.4.3) 

where  5  is  a  square  m  y.  m  non-singular  matrix,  TV  is  m  X  r,  and  5  is  m  X  (n  — 
m  —  r).  B  is  called  the  basis  matrix  and  its  columns  correspond  to  the  basic  variables. 
The  columns  of  TV  correspond  to  the  nonbasic  variables,  i.e.  those  variables  which  are 
equal  to  one  of  their  bounds.  The  columns  of  S  correspond  to  the  remaining  variables, 
which  are  called  superbasic.  The  number  of  superbasic  variables  indicates  the  number  of 
degrees  of  freedom  remaining  in  the  minimization.  In  the  important  special  case  of  linear 
programming  v/here  B  is  a  linear  function,  the  matrix  S  is  null.  With  the  partitioning 
(6.4.3)  of  the  matrix  A,  we  can  write  the  constraints  for  the  subproblem  in  the  form 

where  the  components  of  sire  taken  from  either  I  or  u,  depending  on  whether  the 
lower  or  upper  bound  is  binding. 

We  expand  the  function  T^  in  a  Taylor  series  about  some  feasible  point  *: 

F{x  -h  p)  =  F{x)  +  p(x)^P  +  iP^G'(x)p  4-  •  •  •.  (6.4.5) 


If  F[x)  were  a  quadratic  function,  then  G  would  be  a  constant  matrix,  and  there  would 
be  no  higher-order  terms  in  this  expansion.  In  this  case,  we  could  obtain  a  constrained 
stationary  point  at  i  +  p  by  insisting  that  . 


s ;  5)0-* 


(6.4.6) 


i.e.  the  step  remains  on  the  surface  given  by  the  intersection  of  the  active  constraints; 
also 


f9B\  (Vb\  (B"^ 
I  ffs  I  4-  gI  PS  I  =  I  5^ 

Kgs)  \pnJ 


vtv’’ 


;)(» 


(6.4.7) 


i.e.  the  gradient  at  x  -1-  p  is  expressible  as  a  linear  combination  of  the  active  constraint 
normals.  These  two  conditions  correspond  to  the  conditions  p  =  Zpz  and  (6.3.4)  in  the 
previous  section. 
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For  a  more  general  function  F{x),  the  step  p  may  not  lead  directly  to  a  stationary 
point,  but  (6.4.6)  and  (6.4.7)  can  be  used  to  determine  a  feasible  descent  direction.  From 
(6.4.6)  we  have  p^^  =  0  and  ps  —  — VFps,  where  W  =  B~^S.  Thus, 


Notice  the  correspondence  between  this  matrix  and  the  matrix  Z  computed  using  the 
variable- reduction  method  in  the  previous  section.  As  before,  we  do  not  explicitly 
compute  but  instead  compute  some  factorization  of  this  matrix.  Back-substitution 
is  then  used  to  compute  the  necessary  products  of  W  times  a  vector.  This  matrix  is 
indeed  orthogonal  to  the  working  set  matrix  A.  The  Lagrange  multipliers  (/z  X)^  can  be 
computed  using  the  equations 


=  gB  + {I  0 


and 

\  =  gff-N'^fx  +  {0  0  /)G^  /^p5. 

When  ps  =  0,  these  equations  reduce  to 


P  =  B 
X  =  p/v  - 

The  Lagrange  multipliers  can  be  used  to  modify  the  working  set  of  constraints.  For 
example,  suppose  that  bound  i  is  fixed  at  its  lower  endpoint,  i.e.  we  are  assuming  that 
Xi  =  If  the  Lagrange  multiplier  X,-  corresponding  to  this  bound  is  negative,  then 
the  objective  function  F  will  decrease  locally  if  Xi  is  allowed  to  increase  in  value.  This 
indicates  that  bound  i  could  be  dropped  from  the  working  set.  A  similar  situation  exists 
for  upper  bounds,  but  there  the  Lagrange  multiplier  should  be  positive  if  the  bound  is 
to  be  relaxed. 

Using  the  Taylor  series  (6.4.5)  and  equation  (6.4.6),  we  obtain  that  a  first-order 
condition  for  x  to  solve  ICP  is  that  Z^g{x*)  =  0.  This  is  the  same  condition  as  for  the 
equality-constraint  problem.  We  are  assuming,  of  course,  that  x*  is  a  feasible  point  and 
that  Ax*  =  b. 

The  Taylor  expansion  (6.4.5)  also  leads  us  to  a  second-order  condition  for  a  solution 
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to  ICP.  Because  the  projected  gradient  is  zero,  we  obtain  directly  that 


G*  =  Z'^GZ 


must  be  positive  semi-definite  at  x* . 

Sufficient  conditions  for  a  minimum  are  slightly  more  complex,  since  the  Lagrange 
multiplier  for  an  active  constraint  may  be  zero.  Also,  complications  can  arise  because  a 
variable  could  be  fixed  at  either  a  lower  or  an  upper  bound.  For  simplicity,  we  assume 
that  all  active  bounds  are  lower  bounds.  Results  for  upper  bounds  are  obtained  by 
changing  >  to  <  below.  Keeping  this  in  mind,  we  find  that  x*  is  a  solution  to  ICP  if: 

(i)  Ax*  =  b,  I  <  X*  <  u  and  =  b. 

(ii)  Z'^g{x*)  =  0  (where  AZ  =  0). 

(iii)  X  >  0  (where  X  is  obtained  from  (6.4.7)). 

(iv)  Z'^G{x*)Z  is  positive  definite. 

(v)  If  X,-  =  0,  then  p^G[x*)p  >  0  for  all  p  such  that  pjv<  >  0. 

Assuming  that  an  initial  feasible  point  is  available,  the  general  structure  of  an 
working-set  algorithm  can  be  summarized  as  follows: 

(6.4.6)  Working-set  Algorithm 

Wl.  Let  be  the  current  point.  We  assume  that  is  feasible  and 
that  A  is  the  matrix  of  constraints  active  at  x^*'^ 

W2.  Check  if  x^*^  is  the  solution  of  the  equality-constraint  problem.  If 
not,  go  to  step  W5. 

W3.  Calculate  X  =  gfq  —  If  X  satisfies  the  second-order  sufficient 
conditions  for  a  minimum,  then  x*  is  the  solution  to  ICP.  Terminate  the 
algorithm. 

W4.  If  X,-  <0  for  some  variable  x^.  at  its  lower  bound  (or  X  >  0 
for  some  x^.  at  its  upper  bound),  compute  a  direction  p^^^  such  that 
gW'^pW  <  0,  pW  >  0  (p^J  <  0),  and  pJJ]  =  0  for  i  ^  j.  For  such  a 
p(*\  the  t-th  bound  becomes  inactive  and  is  deleted  from  the  active  set. 

Go  to  step  W6. 

W5.  (</(*')  511^  AX)  Construct  a  direction  p^*^  such  that  A^p^*)  =  0, 

gW  pW  <  0  (i.e.  a  descent  direction  for  the  equality-constraint  problem). 

This  can  be  done  by  solving  the  projected-Ncwton  equations  for  the 
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equality-constraint  problem. 

W6.  (line  search)  Normally,  the  line  search  would  be  based  solely  on 
a  “sufficient  decrease”  in  the  objective  function  F.  In  this  context,  it  is 
possible  to  run  into  a  formerly  inactive  bound  while  searching  along 
If  a  new  constraint  is  encountered  during  the  linesearch,  it  is  then  added 
to  the  active  set.  Go  to  step  Wl. 

Because  there  is  possibly  some  choice  in  step  W4  as  to  which  constraint  to  drop 
from  the  active  set,  various  active-set  strategies  have  been  suggested  for  solving  ICP.  For 
this  thesis,  where  we  are  concerned  with  the  application  of  truncated-Newton  methods, 
the  details  of  the  strategy  are  not  important. 

6.4.2.  The  Application  of  Truncated-Newton  Methods  to  Inequality- 
Constraint  Problems 

Since  the  algorithm  described  in  the  previous  section  computes  a  search  direction  p 
by  solving  a  set  of  projected-Newton  equations,  it  may  appear  that  truncated-Newton 
methods  can  be  applied  directly  for  the  solution  of  ICP.  If  no  preconditioning  is  used, 
this  is  indeed  true.  The  construction  of  p^*^  in  step  W5  is  then  a  local  problem  involving 
the  (approximate)  solution  of  a  set  of  linear  equations. 

However,  when  preconditioning  is  a  part  of  the  algorithm,  certain  complications 
arise.  With  equality  constraints,  the  projection  matrix  Z  remains  constant;  but  with 
inequality  constraints,  the  projection  matrix,  and  hence  the  structure  and  size  of  the 
projected-Newton  equations  (6.3.5),  can  change  from  iteration  to  iteration  as  the  active 
set  changes.  In  this  section,  we  describe  how  to  modify  the  preconditioning  matrix  to 
reflect  these  changes. 

In  order  to  simplify  the  discussion,  we  will  assume  (without  loss  of  generality) 
that  bounds  are  added  or  deleted  one  at  a  time.  We  will  also  assume  that  a  diagonal 
preconditioning  is  being  used.  More  complex  preconditionings  can  be  used,  and  it  is 
straightforward  to  adapt  the  following  discussion  to  the  more  general  case.  In  fact, 
the  ideas  here  are  based  on  the  presentation  in  Gill  and  Murray  [1973b]  where  a  (full) 
quasi-Newton  approximation  to  the  Hessian  is  being  modified. 

Deleting  a  bound  corresponds  to  deleting  a  column  (say  the  last)  from  A.  This 
implies  that  a  column  must  be  added  to  Z: 
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Z  =  iZ\  z). 


Then 

^  ^■‘-\lTCZ  zTGz)' 

(For  large-scale  problems,  this  matrix  would  never  be  explicitly  computed;  we  use  it 
here  only  as  a  theoretical  tool.)  Let  D  be  our  diagonal  preconditioning  corresponding  to 
Z'^GZ.  A  natural  choice  for  D,  the  new  preconditioning,  is  thus 

where  a  =  z'^Gz  (or  a  =  1  if  this  quantity  is  expensive  to  compute). 

Adding  a  bound  is  a  slightly  more  difficult  problem.  This  corresponds  to  reducing 
the  size  of  Z  by  one  column.  If  we  are  deleting  the  g-th  superbasic  variable,  then  we 
just  delete  the  g-th  diagonal  element  of  D  to  obtain  D.  Deletion  of  a  basic  variable 
can  be  achieved  by  interchanging  the  basic  variable  with  a  superbasic  variable,  and  then 
deleting  the  new  superbasic  column  as  indicated. 

The  interchange  of  the  p-th  basic  variable  with  the  g-th  superbasic  variable  can  be 
described  by  an  equation  of  the  form 

Z  =  Z{I  + 

where  e,  is  the  g-th  unit  vector  and  v  is  defined  by  the  equations 

fl’Wp  =  Cp, 

y  =  5%, 

y,  = 

V  —  — (y  -I-  e,). 

Vq 

These  quantities  are  easily  computed. 

We  would  now  like  to  approximate  the  diagonal  D  of 

Z'^GZ  =  (/  +  e,v^)’Z^G'Z(/  +  e,v^) 

given  Z  and  an  approximation  D  to  the  diagonal  of  Z'^GZ.  The  formula  for  the  new 
diagonal  element  dt  is 

3,  =  di  -b  2viz'^Gzi  -1-  z’^Gz^v'i, 
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where  is  the  t-th  column  of  Z.  Unless  the  values  {z'^Gzi}  are  inexpensive  to  compute, 
updating  D  in  this  manner  will  not  be  feasible.  As  a  result,  we  suggest  simply  deleting  the 
g-th  column  of  D;  there  is  little  justification  for  applying  the  transformation  (/  +  e,v^) 
directly  to  D. 

Further  details  concerning  the  treatment  of  constraint  matrices  for  large  problems 
can  be  found  in  Murtagh  and  Saunders  [1978]. 

6.5.  Least-Squares  Problems 

Least-squares  problems  are  concerned  with  finding  a  point  x*  which  minimizes  the 
sum  of  squares  of  nonlinear  functions 

m 

*  e  5R”,  m  >  n.  (6.5.1) 

1=1 

Such  problems  can  be  solved  using  the  minimization  algorithms  described  in  the  previous 
chapters,  but  the  special  form  of  the  function  F  suggests  the  use  of  more  specialized 
techniques. 

The  gradient  vector  g{x)  and  Hessian  matrix  G{a:)  of  F(x)  are  given  by  2J(x)^/(x) 
and  2(J(x)^J(x)  +  B(x))  respectively,  where  J(x)  is  the  m  X  n  Jacobian  matrix  of  /(x) 
whose  i-th  row  is  V/,(a:)  =  (a/,/5ii,a/./ai2,...,9/,/ax„),  B(x)  —  EJli /.WCiC®) 
and  Gi(x)  is  the  Hessian  matrix  of  fi(x).  [F{x)  is  assumed  to  be  twice  continuously 
differentiable,  although  the  methods  discussed  in  this  section  will  often  work  when  this 
condition  does  not  hold.)  The  restriction  that  m  is  greater  than  or  equal  to  n  serves  only 
to  simplify  the  notation. 

If  Newton’s  method  is  applied  to  the  solution  of  (6.5.1),  the  special  form  of  the 
Hessian  matrix  and  gradient  vector  leads  to  the  following  set  of  linear  equations  for  the 
Newton  direction 

( J(xW)^J(i(‘))  +  H(i(‘')))p^^  =  -  J(xW)^/(xW).  (6.5.2) 

The  Gauss-Newton  method  was  the  first  designed  to  exploit  the  special  structure 
of  the  Hessian  matrix  and  gradient  vector  which  occurs  in  least-squares  problems.  The 
method  computes  the  direction  of  search  as  the  solution  to 

J(x(^))^J(x{*=))pgJ^  =  -  J(x(''))^/(x(*)).  (6.5.3) 
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These  equations  are  obtained  by  neglecting  the  second-derivative  matrix  in  (6.5.2). 

The  Gauss-Newton  method  is  intended  for  problems  where  ||5(x)||  is  small  compared  to 
||J(x)^J(x)||,  such  as  the  so-called  “small-residual  problem”  where  f{x)  0  as  x  — »•  x*. 
For  these  problems  the  Gauss-Newton  method  will  ultimately  converge  at  the  same  rate 
as  Newton’s  method,  despite  the  fact  that  only  first-derivative  information  is  used.  We 
will  concentrate  on  that  case  here. 

Truncated-Newton  methods  can  be  applied  directly  to  the  solution  of  the  equations 
(6.5.3)  (or  even  (6.5.2)  if  the  second  derivative  information  in  B{x)  is  available).  The  fact 
that  the  Hessian  matrix  and  gradient  are  of  a  special  form  is  irrelevant  to  the  truncated- 
Newton  algorithm. 

If  we  assume  that  the  matrix  J  is  of  full  rank,  then  the  system  of  equations  (6.5.3) 
will  have  a  positive-definite  coefficient  matrix.  Thus,  unlike  when  we  were  solving  more 
general  optimization  problems,  it  is  possible  to  use  the  regular  linear  conjugate-gradient 
algorithm  to  approximately  solve  (6.5.3).  It  is  possible  to  use  the  algorithm  described  in 
section  2.4,  if  we  set  A  =  J'^J .  However,  because  of  the  factored  form  of  the  coefficient 
matrix  in  (6.5.3),  and  because  we  would  like  to  precondition  the  linear  algorithm,  the 
following  set  of  formulas  is  to  be  preferred: 

Given  po.  Set  sq  =  f  —  Jpo,  tq  =  J'^sq. 

For  q  =  0, 1, . . . 

2q  =  M~^rq 

Uq  =  Zq  +  PqUq—l 

/3o=0 

Vq  —  JUq 
Xq^\  — —  Xq  4" 

a,  =  z’^rqfv^Vq 
Tg-i-l  =  J  Sf.fl 

Next  q. 

Here,  M  is  some  approximation  to  J'^J, 

■  The  only  remaining  problem  is  how  to  generate  the  preconditioning  matrix  M.  When 
the  matrix  J  is  available,  then  we  would  always  use  diag{j^j}.  Otherwise,  we  could 
use  one  of  the  diagonal  preconditionings  described  in  Chapter  5.  They  require  a  pair 
of  vectors  (u,Gu),  or  in  this  case  {u,j’^Ju).  When  the  linear  algorithm  is  programmed 
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using  the  formulas  above,  the  vector  J^Ju  is  not  a  natural  by-product  of  the  algorithm. 
However,  if  we  compute  the  difference  between  the  successive  residuals 

^9+1  ~^q~  *^^*9+1  ~  *9) 

=  —aqj'^JUq, 

% 

we  are  able  to  obtain  the  desired  vector. 

When  the  derivatives  of  the  functions  /,•  are  available,  it  is  also  possible  to  solve  the 
full  Newton  equations  (6.5.2)  using  a  linear  conjugate-gradient  algorithm.  If  the  matrix 
J  were  sparse,  then  this  algorithm  could  be  preconditioned  using  J^J.  Otherwise,  the 
diagonal  of  J'^J  or  an  approximation  to  it  could  be  used. 
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7  Numerical  Results 

7.1.  Introduction 

In  this  Chapter  we  discuss  the  numerical  behavior  of  several  of  the  methods  dis¬ 
cussed  earlier.  It  was  not  feasible  to  test  every  combination  of  techniques  that  has  been 
described,  but  we  have  attempted  to  ascertain  through  selective  testing  the  most  promis¬ 
ing  version  of  a  truncated-Newton  algorithm  for  general  usage.  The  method  used  to 
compare  algorithms  consists  of  applying  them  to  a  set  of  test  problems.  We  do  not  claim 
that  this  is  a  completely  satisfactory  means  of  comparison,  but  we  believe  that,  if  the 
test  problems  are  selected  carefully,  the  evidence  obtained  can  be  a  valuable  aid  in  the 
selection  of  the  best  algorithm. 

This  method  of  testing  has  many  drawbacks.  One  difficulty  is  the  volume  of  data 
that  subsequently  needs  analyzing.  We  have  displayed  the  raw  data  together  with  an 
aggregation  of  the  results.  Too  much  emphasis,  however,  should  not  be  placed  on  the 
aggregated  numbers  since  they  are  unduly  weighted  by  the  more  difficult  problems.  An 
alternative  form  of  display  is  to  enter  as  1  the  best  result  and  have  all  other  entries  be 
their  multiple  of  this  result.  The  drawback  to  this  method  is  in  our  view  more  serious 
since  greater  emphasis  is  then  placed  on  problems  that  are  easily  solved. 

The  popularity  and  success  of  battery  testing  is  largely  due  to  the  fact  that  for  many 
algorithms  the  differences  in  results  are  so  large  as  to  leave  little  doubt  as  to  the  correct 
conclusion.  It  is  also  a  useful  technique  for  demonstrating  that  an  algorithm  is  poor.  The 
converse,  however,  is  not  always  true.  If  an  algorithm  fails  where  others  easily  succeed  it 
demonstrates  a  flaw  in  that  algorithm.  If  an  algorithm  is  simply  a  little  slower  or  faster 
then  this  could  merely  be  due  to  the  luck  of  the  draw. 


7.2  The  assessment  criterion 

All  optimization  software  requires  a  criterion  for  terminating  the  computation  of  the 
sequence  Ideally,  if  we  wish  to  measure  the  comparative  efficiency  of  routines  we 

should  set  the  same  termination  criterion  in  all  the  routines  tested  and  then  compute 
the  cost  of  a  minimization,  in  terms  of  the  number  of  function  evaluations  for  instance. 
However,  there  is  no  universal  agreement  on  what  is  the  best  termination  criterion  and 
a  different  criterion  used  by  another  researcher  may  result  in  a  wide  variation  in  the 
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accuracy  of  the  answer  obtained.  The  question  remains,  therefore,  as  to  the  point  at 
which  we  should  assess  the  efficiency  of  the  various  methods.  The  assessment  criterion 
used  here  is  to  take  the  first  point  for  which 

<  t(1  +  |F(x*)|),  (7.2.1) 

where  t  is  a  scalar.  Some  authors  have  argued  against  the  use  of  (7.2.1)  because  it  includes 
F[x  ),  which  is  unknown  on  real  problems.  We  believe  that  such  authors  are  confusing  an 
assessment  criterion,  where  the  use  of  F{x*)  is  legitimate,  with  a  termination  criterion, 
where  it  is  not. 

If  the  criterion  (7.2.1)  is  to  give  a  realistic  assessment  of  the  performance  of  an 
algorithm,  the  choice  of  r  must  give  a  point  which  is  close  to  a  final  estimate  of  x* 
obtained  with  a  realistic  termination  criterion.  The  relative  performance  of  algorithms 
with  superlinear  convergence  is  almost  invariant  with  the  choice  of  t  and  a  very  small 
value  can  be  used.  For  example,  on  an  IBM  370/168,  where  the  function  can  be  computed 
to  approximately  fifteen  decimal  places  in  double  precision,  a  reasonable  choice  of  t  is 
10~^°.  However,  for  conjugate-gradient  type  methods,  which  exhibit  a  linear  rate  of 
convergence,  the  performance  can  vary  widely  with  the  choice  of  r.  It  is  not  unusual 
for  the  number  of  function  evaluations  to  be  three  times  greater  for  t  =  10“^°  than  for 
T  =  10“®.  In  this  case  it  is  important  that  a  moderate  termination  criterion  be  used.  In 
all  the  tests  carried  out  for  this  study,  t  was  set  at  10“®. 


7.3  The  algorithms  tested 

The  results  of  this  chapter,  in  addition  to  exhibiting  the  performance  of  a  variety  of 
truncated-Newton  algorithms,  illustrate  the  numerical  behaviour  of  three  algorithms  for 
general  unconstrained  minimization.  These  are: 

1.  Algorithm  PLMA 

Diagonally  preconditioned  two-step  BFGS  formula  with  accumulated  step  (see  Gill 
and  Murray  [1979]). 

2.  Algorithm  MNM 

A  modified  Newton  method  using  first  and  second  derivatives  (see  Gill  and  Murray 
[1974a]). 
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3.  Algorithm  QNM 


A  quasi-Newton  method  using  the  full  n  X  n  BFGS  update  of  the  approximate 
Hessian  Matrix  (see  Gill  and  Murray  [1972]). 

The  use  of  these  accepted  and  widely-tested  algorithms  gives  us  an  objective  test  of  the 
overall  effectiveness  of  our  truncated-Newton  methods. 


7.4  The  test  examples 

The  provision  of  suitable  test  problems  is  extremely  difficult.  Problems  that  are 
used  to  measure  the  efficiency  of  algorithms  for  small  dense  problems  are  completely 
unsatisfactory  since  the  algorithms  considered  here  are  intended  mainly  for  large-scale 
problems.  For  example,  it  is  pointless  to  test  a  truncated-Ncwton  method  on  a  very  small 
problem  since  the  algorithm  will  be  effectively  performing  a  full  Newton  iteration. 

A  serious  difficulty  with  using  very  large  test  problems  is  that,  for  all  but  the  most 
trivial  examples,  the  CPU  time  necessary  to  compute  the  objective  function  will  be  very 
large.  This  is  typically  the  case  if  we  attempt  to  use  real-world  problems  for  testing 
purposes.  Moreover,  it  is  desirable  that  problems  be  defined  in  such  a  way  that  they 
may  be  used  by  other  researchers.  Large-scale  real-world  problems  almost  invariably  are 
written  in  a  non-portable  form  or  can  be  run  only  with  vast  quantities  of  numerical  data. 

In  this  study  we  have  attempted  to  compromise  on  these  issues  by  collecting  a  set  of 
non-trivial  problems  that  can  be  run  with  moderate  ease  at  other  installations.  Eighteen 
problems  are  considered.  Of  these,  16  problems  are  of  dimension  50  or  greater  and  7 
problems  are  of  dimension  100.  It  is  necessary  to  present  an  extensive  number  of  results 
because  the  performance  of  conjugate-gradient-type  methods  is  generally  erratic.  If  we 
are  to  identify  which  strategy  gives  a  true  improvement  in  performance,  a  wide  spectrum 
of  results  must  be  considered. 

The  test  examples  may  be  separated  into  two  classes.  The  first  class  contains 
problems  whose  Hessian  matrix  at  the  solution  has  clustered  eigenvalues;  the  second 
contains  problems  whose  Hessian  matrix  has  an  arbitrary  eigenvalue  distribution. 
Example  1.  Penl  (Gill,  Murray,  and  Pitfield  [1972]) 


F(x}  =  a  -1)^  -hb 

t=i 


78 


The  solution  varies  with  n,  but  a:,-  =  x*:,.!,  t  =  1.  All  the  runs  made 

were  with  a  =  1,  6  =  10~®.  With  these  values,  the  Hessian  matrix  at  the  solution 
has  n  —  1  eigenvalues  0(1)  and  one  eigenvalue  0(10“®).  The  Hessian  matrix  is  full 
and  consequently,  for  large  values  of  n,  conjugate-gradient  type  methods  are  the  only 
techniques  available. 

Example  2.  Pen2  (Gill,  Murray,  and  Pitfield  [1972]) 


where  Cj  =  e’OO  +  e(»-i)/io  i  =  2, . . . ,n.  The  solution  varies  with  n,  but  x,-  =  x»+i 
for  t  =  1, . . . ,  n  —  1.  This  example  was  also  run  with  0  =  1  and  b  =  10“®.  For  these 
values  the  Hessian  matrix  at  the  solution  has  n  —  2  eigenvalues  0(1)  and  two  eigenvalues 
0(10“®).  The  Hessian  matrix  is  full. 

Example  3.  Pen3  (Gill,  Murray,  and  Pitfield  [1972]) 


{n-2 

1  +  e*"  +  2x,+i  +  10X.+2  -  1)^ 

+  (E  (*•  +  2X.+1  -f-  10x,+2  -  1)’')(E  (2a;.  +  a:.+i  - 

n-2  ^ 

+  e*—'  ^  (2x<  +  Xi+1  -  3)*| 

/  n  X  2  n/2 

\=l  ■'  i=i 


At  the  minimum,  this  function  has  n/2  eigenvalues  0(1)  and  n/2  eigenvalues  0(10“®). 
The  Hessian  matrix  is  full. 

The  remaining  examples  have  arbitrary  distributions  of  eigenvalues  at  the  solution. 
Example  4.  Chebyquad  (Fletcher  [1965]) 


where 


^(®)  =  E  •^•(*)^ 

»=i 

I  ±T:M.  i=l . . 


y=i 
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and  ^^(a:)  is  the  i^^-order  shifted  Chebyshev  polynomial.  The  Hessian  matrix  is  full. 
Example  5.  GenRose 

This  function  is  a  generalization  of  the  well-known  two-dimensional  Rosenbrock 
function  (Rosenbrock  [I960]). 

F{x)  =  1-1-  ^(l00(xi  -  Xi_if  -1-  (1  -  X,)*). 

»=2 

Our  implementation  of  this  function  differs  from  most  others  in  that  F{x)  is  unity  at 
the  solution  rather  than  zero.  This  modification  ensures  that  the  function  cannot  be 
computed  with  unusually  high  accuracy  at  the  solution  and  is  therefore  more  typical  of 
practical  problems.  °  ] 

The  next  three  examples  arise  from  the  discretization  of  problems  in  the  calculus  of 
variations.  Similar  problems  arise  in  the  numerical  solution  of  optimal  control  problems. 
The  general  continuous  problem  is  to  find  the  minimum  of  the  functional 

J (x(<))  =  /  /(<,  x{t),  x'(f))  dt, 

J  0 

over  the  set  of  piecewise  differentiable  curves  with  the  boundary  conditions  x(0)  =  a, 
a;(l)  =  h.  If  x{t)  is  expressed  as  a  linear  sum  of  functions  that  span  the  space  of  piecewise 
cubic  polynomials  then  minimization  of  J  becomes  a  finite-dimensional  problem  with  a 
tri-diagonal  Hessian  matrix. 

Example  6.  Call  (Gill  and  Murray  [1973a]) 

J(x{t))  =  J  |x(<)^  -I-  x'{t)  tan~^  x'(t)  —  log(l  +  x'(f)^)*  |  dt, 

with  the  boundary  conditions  x(0)  =  1,  i(l)  =  2. 

Example  7.  Cal2  (Gill  and  Murray  [1973a]) 

j{x{t))  =  J  |l00(x(t)  —  x'(f)’*)^ -1- (l  —  i'(<))**|df, 

with  the  boundary  conditions  x(0)  =  x(l)  =  0. 

Example  8.  Cal3  (Gill  and  Murray  [1973a]) 
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with  the  boundary  conditions  i(0)  =  1,  x{l)  =  0. 

Example  9.  QOR  (Toint  [1978]) 

50  33  f 

F(x)=  +  -  E  E 

i=i  t=i  yeB(») 

where  the  constants  a,-,  Pi,  di  and  sets  A(i)  and  B[i)  are  described  in  Toint’s  paper.  This 
function  is  convex  with  a  sparse  Hessian  matrix. 

Example  10.  GOR  (Toint  [1978]) 

50  33 

=  E  E 

t=i  t=i 

where 

O',!,  log^(l  +  Xi),  Xi  >  0, 

-UiXi  log  Jl  +  a:,),  <  0, 

Vi  =di-  Xj  +  Y 

i€A(i)  jeB(0 

and 

A2/iloge(l +  »,),  y,- >  0, 

PiVi,  Vi  <  0- 

The  constants  ai,  Pi,  di  and  sets  A(i)  and  B{i)  are  deiined  as  in  Example  QOR.  This 
function  is  convex  but  there  are  discontinuities  in  the  second  derivatives. 

Example  11,  ChnRose  (Toint  [1978]) 

25 

F{x)  =  1  +  E  (4«»(a:.-i  -  +  (1  “  Xif), 

i=2 

where  the  constants  a,  are  those  used  in  the  example  QOR.  The  value  of  F[x)  at  the 
solution  has  been  modified  as  in  Example  5.  The  Hessian  matrix  is  tri-diagonal. 

7.5  Starting  points 

The  starting  points  used  were  the  following. 

Start  1 

x(0)  =  (0,0,...,0)’’. 
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Start  2 


Start  3 


1  2 

n  +  l’n+l’”"’ 


n 

n  + 1 


=(1,-1, 1,-1,...  f. 

Start  4 

x(°)=(-l,-l,...,-lf. 


7.6  Description  of  the  tests 

All  the  algorithms  are  coded  in  double-precision  Fortran  IV.  The  runs  were  made  on 
an  IBM  370/168,  for  which  the  relative  machine  precision,  £,  is  approximately  10“^®. 

Each  algorithm  requires  two  additional  user-specified  parameters:  X,  the  bound  upon 
the  change  in  x  at  each  iteration,  (the  quantity  ||x(*+^)  -  x('')||2)  and  Fejt,  an  estimate 
of  the  value  of  the  objective  function  at  the  solution.  For  all  problems,  the  value  of  X 
was  set  at  10,  and  Feat  was  set  to  the  value  of  F{x)  at  the  solution. 

For  the  initial  testing,  a  limited  set  of  test  functions  was  used.  This  set  includes:  Penl 
{N  —  50,  Start  3),  GenRose  (iV  =  50,  Start  2),  Call  (jV  =  50,  Start  1),  and  Chebyquad 
(N  —  20,  Start  2).  These  four  functions  have  quite  different  behavior,  and  it  was  found 
that  performance  on  these  test  functions  was  often  indicative  of  the  performance  of  an 
algorithm  on  the  complete  battery  of  test  functions.  For  these  limited  tests,  only  the 
value  rj  =  .25  was  used  and  an  time  limit  of  15  seconds  was  placed  on  each  test  run. 

In  order  to  determine  a  “good”  truncated-Newton  algorithm  and  also  to  compare  the 
performance  of  this  good  truncated-Newton  algorithm  against  better-known  algorithms, 
more  complete  tests  were  carried  out.  The  complete  set  of  test  functions  was  used,  and 
the  values  t]  =  .25,  .1,  .001  were  tried.  Many  of  these  numerical  results  were  obtained  by 
Gill  and  Murray  [1979].  Since  the  optimal  value  of  ij  is  often  larger  for  algorithms  which 
use  second-derivative  information,  a  series  of  tests  with  rj  =  .5,  .7,  .9.  was  also  carried 
out.  Finally,  a  special  set  of  comparisons  against  Newton’s  method  was  done. 

The  full  set  of  results  is  contained  in  the  tables  in  the  appendix.  Each  entry  in  a 
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tabic  consists  of  a  pair  of  values:  the  first  is  the  number  of  non-linear  iterations  required 
to  find  the  solution,  the  second  is  the  number  of  function/gradient  evaluations  (unless 
otherwise  indicated,  this  number  includes  the  function  evaluations  in  the  linesearch  as 
well  as  those  used  to  compute  the  matrix-vector  products  in  the  linear  sub-algorithm).  A 
number  of  the  test  functions  have  sparse  Hessian  matrices,  and  in  these  cases  it  is  possible 
to  use  sparse  finite-differencing  to  compute  these  matrices  at  the  beginning  of  each  non¬ 
linear  iteration.  A  lower-case  “s”  at  the  end  of  a  function  name  (for  example,  GenRs) 
indicates  that  sparse  finite-differencing  is  being  used.  (The  routines  for  computing  the 
sparse  Hessian  matrices  were  developed  by  Thapa  [1980].) 


7.7  Discussion  of  results 

The  first  tests  were  used  to  determine  the  better  preconditioning  strategies.  These 
are  summarized  in  table  1.  The  routine  used  was  a  preconditioned  Lanczos  algorithm 
with  forcing  function  (4.3.6)  and  the  standard  forcing  sequence  (4.4.3).  PLMA  was  the 
non-linear  outer  algorithm.  The  terms  used  to  describe  the  preconditioning  strategies 
correspond  to  section  8.2.3;  the  letters  DNC  indicate  that  the  algorithm  did  not  converge 
in  15  CPU  seconds. 

The  results  indicate  that  the  diagonal  preconditionings  are  the  most  effective.  The 
exact  diagonal  of  the  Hessian  often  performs  very  well;  the  only  exception  is  for  the 
function  GenR,  where  the  Hessian  matrix  is  frequently  indefinite.  A  better  strategy  for 
handling  negative  diagonal  elements  might  improve  the  result  in  this  case  (for  these  runs, 
negative  diagonal  elements  were  replaced  by  their  absolute  value). 

Tables  2  and  3  show  the  effects  of  different  forcing  sequences.  The  exact  BFGS 
diagonal  preconditioning  was  used  in  combination  with  the  routine  described  for  table 
1.  Table  2  was  made  using  the  forcing  function  (4.3.6)  and  table  3  with  function  (4.3.5). 
The  tests  in  table  2  follow  naturally  from  the  discussion  in  section  4.4;  those  in  table  3 
were  made  because  the  standard  forcing  sequence  {<j>k  ==  min  (l/fc,  ||ff^*'^||})  was  found  to 
be  too  stringent  in  combination  with  function  (4.3.5). 

Our  experience  with  alternative  forcing  sequences  has  been  inconclusive;  the  tests  in 
Table  2  were  included  to  show  how  well  they  sometimes  performed  on  specific  functions. 
The  standard  forcing  sequence  is  quite  effective  for  the  class  of  problems  chosen.  Table 
3  shows  that  scaling  this  forcing  sequence  by  1.5  is  worthwhile  when  function  (4.3.5)  is 
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being  used. 

Tables  4A-4E  were  used  to  choose  the  optimal  truncated-Newton  routine.  The  three 
approximate  diagonal  preconditionings  were  used  on  all  test  problems  with  rf  =  .25  in 
combination  with  the  following  routines: 

1.  TNI — a  preconditioned  Lanczos  algorithm  with  the  standard  forc¬ 
ing  sequence  and  forcing  function  (4.2.2), 

2.  TN2 — a  preconditioned  Lanczos  algorithm  with  the  standard  forc¬ 
ing  sequence  scaled  by  1.5  and  forcing  function  (4.3.5), 

3.  TN3 — as  in  TNI,  but  with  forcing  function  (4.3.6), 

4.  MINR — a  preconditioned  MINRES  algorithm. 

All  routines  use  PLMA  as  the  non-linear  outer  algorithm.  The  numbering  of  the  precon¬ 
ditionings  corresponds  to  the  list  in  section  8.2.3. 

The  totals  from  all  the  runs  are  listed  in  table  4E.  The  best  routine  could  be  decided 
upon  in  a  number  of  ways: 

1.  iteration  count 

2.  function  evaluations  (regular) 

3.  function  evaluations  (sparse) 

4.  function  evaluations  (total) 

5.  function  evaluations  (regular)  plus  iteration  count 

6.  function  evaluations  (sparse)  plus  iteration  count 

7.  function  evaluations  (total)  plus  iteration  count 

With  the  exception  of  criteria  2  and  5  where  TN2  PC=1  is  best,  the  totals  indicate  that 
TNI  PC=3  is  the  optimal  routine. 

In  order  to  ascertain  the  overall  effectiveness  of  truncated-Ncwton  algorithms, 
routine  TNI  PC=3  (hereafter  referred  to  simply  as  TN)  was  compared  with  PLMA, 
MNM,  and  QNM  on  the  full  set  of  test  functions  for  jj  =  .25,  .1,  .001.  In  addition,  a  very 
simple  truncated-Newton  algorithm  was  examined  both  with  and  without  an  exact  BFGS 
diagonal  preconditioning  (see  section  8.3.2).  These  routines  are  referred  to  as  PBTN  and 
BTN,  respectively  (the  initials  P  and  B  stand  for  “preconditioned”  and  “basic”).  The 
results  of  these  tests  can  be  .found  in  tables  5A-5E;  NR  indicates  that  a  test  was  not  run, 
and  NA  that  a  total  is  not  available. 

As  table  5E  indicates,  TN  only  requires  50%  to  80%  as  many  function  evaluations 
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as  PLMA  to  solve  the  full  set  of  test  problems.  This  seems  a  significant  reduction.  When 
sparse  finite-differencing  is  used,  both  PBTN  and  BTN  can  compare  favorably  with  TN. 
Without  this  feature,  however,  they  are  considerably  slower.  This  is  especially  true  of 
the  unpreconditioned  routine  BTN;  a  major  factor  is  the  performance  of  this  routine  on 
the  problem  Call  N  =  100. 

The  set  of  tests  summarized  in  tables  6A  and  6B  shows  the  performance  of  routine 
TN  with  different  values  of  ij.  Regardless  of  which  performance  measure  is  used,  .25 
is  always  the  overall  optimal  value  of  17  for  this  routine.  The  results  were  insensitive 
to  the  various  choices  of  rj.  This  is  in  marked  contrast  to  PLMA.  The  main  reason  for 
this  insensitivity  was  that  the  initial  step  was  close  to  the  minimum  along  the  search 
direction. 

The  final  set  of  tests,  summarized  in  tables  7A  and  7B,  are  a  special  comparison  of  the 
truncated-Newton  and  modificd-Newton  algorithms.  When  TN  and  MNA  were  compared 
in  the  tables  5,  the  work  required  to  compute  second-derivative  information  was  ignored 
for  MNA  but  counted  for  TN,  even  though  TN  only  computes  partial  second-derivative 
information  whereas  MNA  computes  the  full  Hessian  matrix.  In  the  tables  7,  these 
two  routines  are  evaluated  in  a  fairer  way.  It  is  assumed  that,  at  the  beginning  of  each 
nonlinear  iteration,  the  full  Hessian  is  evaluated  and  then  used  either  to  solve  the  Newton 
equations  (in  the  case  of  MNA)  or  to  compute  the  necessary  matrix/vector  products  (for 
TN).  As  a  result,  in  the  number  pairs  in  the  tables,  the  first  number  indicates  the 
number  of  Hessian  matrices  computed,  and  the  second  the  number  of  function/gradient 
evaluations  used  in  the  linear  search.  As  the  totals  indicate,  the  truncated-Newton 
algorithm  requires  fewer  Hessian  matrices  as  well  as  fewer  function/gradient  evaluations. 
In  fact,  TN  is  almost  twice  as  efficient  as  MNA.  This  is  especially  surprising  since  TN  is  a 
routine  designed  for  large-scale  function  minimization  and  not  for  general  optimization, 
like  MNA. 

7.8  A  supplementary  test  problem 

•  Up  until  this  point,  all  of  the  algorithms  have  been  tested  on  a  particular  set  of 
test  problems.  This  raises  the  question  of  whether  a  good  truncated-Newton  algorithm 
has  been  found  or  whether  we  have  just  found  the  optimal  algorithm  for  this  set  of 
test  functions.  In  addition,  for  practical  reasons  we  have  limited  ourselves  to  relatively 
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small  test  functions  (n  <  100).  For  this  reason,  we  now  test  algorithm  TN  on  a  larger, 
independent  test  example. 

The  function  is  taken  from  Murtagh  and  Saunders  [1980].  It  investigates  the  optimal 
control  of  a  spring,  mass,  and  damper  system.  In  its  original  form,  the  problem  has  a 
quadratic  olijective  function  and  a  set  of  equality  and  inequality  constraints: 

1 

^  t=o 

subject  to 

Xt+i  =  xt  +  0.2yt 

Vt+i  =  yt  —  O.Olj/t  —  O.OOlit  +  0.2«t 

-0.2  <ut<  0.2 

yt  >  -1.0 

for  i  =  0, . . . ,  r  —  1,  and 


xq  —  10,  Vo  =  0,  yr  =  0. 

The  starting  point  used  was  it  =  0,  y*  =  — 1  (t  =  0, . . . ,  T),  and  Ut  =  0  {t  —  0, . . .  ,T  — 
1).  For  these  tests,  T  —  100,  and  so  there  are  302  variables  in  all. 

Since  this  is  a  constrained  optimization  problem,  and  algorithm  TN  is  only  designed 
to  solve  unconstrained  problems,  we  minimize  a  related  penalty  function: 

F{x,y,u)  =  pf{x,y,u)  +  c'^c. 

Here,  c  is  a  vector  with  one  component  for  each  constraint  above.  For  example,  ci  = 
Xi  —  Xq  —  0.22/0*  If  c,-  is  a  component  corresponding  to  an  inequality  constraint  such  as 
2/t  >  —1.0,  then  c,-  =  2/t  +  1-0  if  yt  <  —1.0,  and  c,-  =  0.0  otherwise.  The  parameter  p  is 
a  penalty  coefficient;  the  smaller  its  value,  the  more  stringently  the  constraints  must  be 
satisfied.  For  our  tests,  p  was  set  equal  to  10“®,  10“®,  and  10”'^.  The  minimal  value  of 
the  objective  function  /  subject  to  the  given  constraints  is  1186.382. 

Setting  the  penalty  parameter  p  =  lO”®  was  not  sufficient  for  our  purposes,  since 
the  minimum  of  the  penalty  function  was  quite  different  from  the  minimum  of  the  original 
function;  in  this  case,  the  final  value  of  /  was  729.2,  not  even  close  to  the  optimal  value. 
There  were  also  problems  with  p  =  10“^,  but  for  quite  different  reasons.  Recall  that 
the  convergence  criterion  for  the  algorithm  is  given  by  (7.2.1),  where  t  =  10“®.  Here, 
P'  fmin  =  pX  1186.382  =  1.2  X  10“'*,  so  that  only  two  digits  of  the  optimal  function 
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value  were  obtained. 

For  p  =  10“®,  the  final  computed  function  value  was  /  =  1190.384,  which  is 
close  to  the  optimal  value  of  the  constrained  function.  The  value  of  c^c,  the  square 
of  the  norm  of  the  constraint  violations,  was  approximately  10“^.  When  second  deriva¬ 
tives  were  available  to  compute  the  matrix/vector  products,  algorithm  TN  required  168 
function/gradient  evaluations  to  minimize  this  penalty  function.  Murtagh  and  Saunders 
[1980],  using  a  projected  Lagrangian  algorithm,  required  203  function/gradient  evalua¬ 
tions  to  obtain  a  solution  with  c^c  <  10~*^.  Although  these  two  results  are  not  directly 
comparable,  they  do  indicate  that  the  truncated-Newton  algorithm  is  effective  in  solving 
this  problem.  When  the  matrix/vector  products  were  computed  using  finite  differencing, 
algorithm  TN  required  1727  function/gradient  evaluations  to  minimize  F.  If  sparse  finite- 
differencing  had  been  used  to  approximate  the  Hessian,  423  function/gradient  evaluations 
would  have  been  used  (each  Hessian  can  be  computed  using  five  gradients). 

There  is  reason  to  assume  that  truncated-Newton  algorithms  will  in  general  perform 
well  on  penalty  functions.  Because  of  the  special  form  of  F,  the  Hessian  matrix  will  often 
have  two  clusters  of  eigenvalues.  The  first,  corresponding  to  the  objective  function  p  •  /, 
will  be  0{p))  the  second,  corresponding  to  the  penalty  term,  will  be  0(1).  The  Lanezos 
algorithm,  applied  to  the  soluton  of  the  Newton  equations,  works  well  if  the  matrix  has 
only  a  few  clusters  of  eigenvalues.  Also,  the  Lanezos  algorithm  is  able  to  quickly  and 
accurately  approximate  the  extreme  eigenvalues  of  a  matrix  (see  Parlett  [1980],  section 
12-5).  Hence,  if  a  truncated-Newton  algorithm  is  applied  to  a  penalty  function,  where  at 
each  stage  the  Newton  equations  involve  a  matrix  whose  eigenvalues  fall  into  two  clusters 
at  the  ends  of  the  spectrum,  good  performance  should  result. 
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8  Adapting  Truncated-Newton  Methods 
8.1.  Introducton 

When  Truncated-Newton  methods  were  presented  in  Chapter  3,  the  basic  algorithm 
was  deliberately  left  vague.  This  was  because  there  are  many  ways  in  which  such  an 
algorithm  can  be  implemented.  At  each  step,  a  choice  must  be  made  about  how  a  certain 
result  or  effect  is  to  be  achieved. 

Some  possible  choices  were  outlined,  or  at  least  mentioned,  in  the  succeeding  chap¬ 
ters.  In  Chapter  3,  algorithms  for  approximately  solving  the  Newton  equations  were 
developed.  In  Chapter  4,  we  described  ways  of  terminating  the  linear  algorithm.  And  in 
Chapter  5,  it  was  shown  how  the  method  could  be  preconditioned  using  other  linear  and 
non-linear  methods. 

When  designing  a  program  for  a  specific  computer,  or  when  choosing  a  method  to 
solve  a  specific  problem,  decisions  must  be  made  about  which  method  to  use  and  how  it 
will  be  implemented.  In  the  case  of  a  truncated-Newton  method,  many  rather  detailed 
options  have  to  be  selected  in  order  to  obtain  a  usable  algorithm. 

Often,  the  first  question  asked  is  which  algorithm  is  the  most  efficient  for  solving  the 
given  problem  or  a  wide  class  of  problems.  Answers  to  this  question  are  usually  based  on 
numerical  tests,  which  were  the  subject  of  Chapter  7.  But  this  is  not  the  only  criterion 
for  selecting  an  algorithm.  Another  important  question  is  which  method  is  most  stable. 
This  question  can  sometimes  be  answered  absolutely  on  the  basis  of  theoretical  results 
from  perturbation  theory. 

Many  other  questions  arise  because  of  more  practical  issues  such  as:  (a)  the  expense 
of  computing  the  function  being  minimized,  (b)  the  availability  of  second  derivatives,  (c) 
the  size  of  the  computer,  (d)  the  availability  of  routines  in  a  program  library,  (e)  the 
number  of  times  a  problem  is  to  be  solved,  etc. 

Until  recently,  it  was  generally  assumed  that  researchers  would  be  working  on  a  large 
central  computer,  and  that  professionally  written  software  would  be  available  on-line  in 
a  subroutine  library.  With  the  rise  of  the  small-computer  industry,  this  assumption  is 
now  often  false,  and  it  is  now  necessary  to  take  into  account  the  effect  of  small  machines 
when  designing  algorithms.  On  a  small  computer,  the  size  of  the  program  can  be  as 
important  a  consideration  as  the  size  of  the  problem.  This  is  not  just  because  storage 
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space  is  at  a  premium:  numerical  program  libraries  for  small  machines  are  still  rare,  and 
the  user  must  oRen  write  his  own  programs  or  manually  input  commercial  programs. 
Short  and  simple  algorithms  can  greatly  reduce  the  likelihood  of  error. 

In  addition,  small  computers  arc  often  owned  by  the  user,  and  are  generally  used  by  a 
small  group  of  people  only.  This  means  that  a  routine  considered  slow  in  a  large-machine 
environment  can  be  attractive  if  it  offers  a  substantial  reduction  in  storage  requirements. 
It  could  be  left  to  run  for  long  periods,  for  example  overnight,  with  little  inconvenience. 
This  can  greatly  influence  the  choice  of  an  algorithm;  the  optimal  routine  for  a  large 
machine  can  have  little  resemblance  to  the  optimal  routine  for  a  mini-computer. 

We  mentioned  above  several  questions  related  to  the  actual  problem  being  solved — 
the  difficulty  of  computing  the  function,  and  the  availability  of  second-derivative  infor¬ 
mation,  for  example.  Choosing  a  method  based  on  these  criteria  often  depends  on  the 
efficiency  of  the  method,  and  the  choice  must  be  made  on  the  basis  of  numerical  tests. 
Some  decisions,  however,  can  be  made  a  priori,  such  as  general  decisions  about  solving 
the  Newton  equations  and  about  how  matrix/vector  products  are  to  be  computed. 

In  order  to  simplify  the  process  of  choosing  a  specific  algorithm,  we  summarize  in 
section  8.2  the  possibilities  for  a  truncated-Newton  algorithm.  There  we  list  the  choices 
for  each  step  of  the  algorithm,  indicate  operation  and  storage  counts,  describe  possible 
interactions  with  other  modules  in  the  method,  and  mention  difficulties  that  might  be 
encountered  in  programming.  In  8.3,  a  couple  of  sample  situations  are  described,  along 
with  suggestions  about  how  to  put  together  a  truncated-Newton  algorithm  which  is  well- 
suited  to  the  needs  of  each  case. 


8.2.  Choices  for  sub-algorithms 

In  order  to  specify  a  truncated-Newton  algorithm,  five  sub-algorithms  must  be 
selected.  These  are: 

1.  The  algorithm  to  approximately  solve  the  Newton  equations  (2.2.3). 

2.  The  non-linear  outer  algorithm. 

3.  The  linear  preconditioning  strategy. 

4.  The  termination  criterion  and  forcing  sequence  for  the  linear  algorithm. 

5.  The  algorithm  for  computing  the  Hessian/vcctor  products. 

These  sub-algorithms  have  been  discussed  at  length  in  preceding  chapters,  but  mainly 
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from  a  theoretical  point  of  view.  In  this  section,  we  shall  take  up  issues  which  would 
arise  when  programming  and  using  truncated-Newton  methods. 

In  the  succeeding  sub-sections,  we  will  examine  each  of  these  choices  separately, 
indicating  operation  and  storage  counts.  The  notation  for  vectors  is  global,  so  that 
if  a  vector  name  appears  in  two  sections,  the  same  vector  is  being  referred  to  and  no 
additional  storage  is  required.  Generally,  all  choices  may  be  made  independently,  but 
usually  only  one  choice  may  be  made  from  each  section.  Exceptions  to  this  rule  will  be 
noted  as  they  occur. 

For  reference,  here  is  a  list  of  the  vectors  used  below: 

X  the  current  estimate  of  the  minimizing  point  F 
p  search  direction 

g  gradient 

Po  initial  search  direction,  used  in  Beale’s  method 
D  vector  which  represents  diagonal  preconditioning  matrix 
Di  temporary  value  of  D 

E  represents  sub-diagonal  of  preconditioning  matrix 

El  temporary  value  of  E 

а, -  scratch  vectors 

8.2.1.  Approximately  solving  the  Newton  equations 

We  shall  consider  six  ways  of  approximately  solving  the  Newton  equations  (2.2.3): 

1.  conjugate-gradient  (section  2.4) 

2.  preconditioned  conjugate-gradient  (Concus  et.  al  [1976]) 

3.  Lanezos  (sections  3. 3-3.5) 

4.  preconditioned  Lanezos  (sections  3.3-3.6,  3.8) 

5.  MINRES  (section  3.7) 

б.  preconditioned  MINRES  (sections  3.7-3.8) 

The  storage  requirements  and  operation  counts  for  these  methods  are  summarized  in 
Table  8.1  below.  Some  problems  that  might  be  encountered  when  using  these  methods 
are  also  mentioned  briefly  there.  For  the  reasons  given  in  section  3.2,  SOR-rclated 
methods  will  not  be  examined.  Clearly,  only  the  preconditioned  algorithms  can  be  used 
in  combination  with  a  non-linear  algorithm  or  a  linear  preconditioning  scheme. 
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The  conjugate-gradient  algorithm  is  the  simplest  algorithm  that  we  consider  feasible 
for  solving  the  Newton  equations.  Unfortunately,  the  conjugate-gradient  method  is  only 
designed  to  solve  systems  of  equations  with  positive-definite  matrices.  In  an  optimization 
setting  where  the  Hessian  matrix  in  the  Newton  equations  can  be  indefinite,  this  is  a 
serious  deficiency;  but  if  a  given  problem  is  known  to  have  a  positive-definite  Hessian 
everywhere,  this  may  not  matter.  When  the  Hessian  is  indefinite,  the  conjugate-gradient 
method  may  be  unstable.  The  addition  of  preconditioning  can  greatly  improve  the 
performance  of  this  method  at  little  computational  cost.  Therefore,  except  under  extreme 
circumstances,  a  preconditioned  conjugate-gradient  method  is  always  to  be  preferred  over 
the  regular  conjugate-gradient  method. 


ALGORITHM 

STORAGE 

OPERATIONS 

COMMENTS 

conjugate-gradient  (eg) 

P>  —  S3 

8n, 

matrix-vector  product 

only  for  positive- 
definite  systems 

preconditioned  eg 

P,  Si  -  84 

8n, 

matrix-vector  product, 
preconditioning  step 

only  for  positive- 
definite  systems 

Lanezos 

p,  «i  -  84 

12n, 

matrix-vector  product 

complex  to  program 

preconditioned  Lanezos 

Pi  Si  —  «5 

12n, 

matrix-vector  product, 
preconditioning  step 

complex  to  program 

MINRES 

Pi  Si  —  S5 

16n, 

matrix- vector  product 

complex  to  program 

preconditioned  MINRES 

Pi  Si  —  S6 

16n, 

matrix-vector  product, 
preconditioning  step 

complex  to  program 

Table  8.1  Choices  for  the  linear  algorithm 

In  order  to  be  able  to  treat  indefinite  systems  of  equations,  it  is  possible  to  use 
a  method  based  on  the  Lanezos  algorithm  for  tridiagonalizing  a  symmetric  matrix. 
It  is  slightly  more  expensive  to  use  than  the  conjugate-gradient  method;  it  is  also 
more  complex  to  program  since  it  involves  three  separate  sub-algorithms:  the  Lanezos 
tridiagonalization,  the  modified-Cholesky  factorization,  and  the  conjugate-gradient  step. 
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In  a  general  setting,  though,  it  is  a  stable  and  predictable  way  of  handling  indefinite 
Hessian  matrices,  which  is  not  true  of  the  conjugate-gradient  algorithm.  Again,  it  is  easy 
to  add  a  preconditioning  step. 

The  MINRES  algorithm  is  a  variant  of  the  Lanczos  algorithm  which  guarantees  that 
the  norm  of  the  residual  decreases  at  each  iteration.  It  is  based  on  a  QR  factorization 
of  the  tridiagonal  matrix  resulting  from  the  Lanczos  process.  Programming  this  method 
is  comparable  to  programming  the  Lanczos  method  above,  and  it  is  equally  easy  to 
precondition  the  algorithm. 

8.2.2.  Non-linear  algorithms 

We  shall  look  at  five  non-linear  outer  algorithms: 

1.  linesearch  (section  1.4) 

2.  non-linear  conjugate-gradient  (section  2.4) 

3.  Beale’s  method  (Gill  and  Murray  [1979]) 

4.  limited-memory  quasi-Newton  (section  2.5,  Gill  and  Murray  [1979]) 

5.  quasi-Newton  (section  2.3) 

Describing  a  linesearch  as  a  non-linear  outer  algorithm  may  be  something  of  an  over¬ 
statement.  We  are  referring  to  the  following  method:  1)  approximately  solve  the  Newton 
equations  at  the  current  point  to  compute  a  search  direction,  2)  use  this  search  direction 
in  the  linesearch  to  compute  a  new  point. 


ALGORITHM 

STORAGE 

OPERATIONS 

COMMENTS 

linesearch 

P,  9,  X,  8l 

>  3n, 

difficult  to  program 

non-linear  eg 

P,  9 

3n  —  4n, 

requires  a  linesearch 

Beale’s  method 

Pi  PO)  9 

8n  —  lOn, 

requires  a  linesearch 

limi  ted-memory 
quasi-Newton  method 
{k  updates) 

Pi  ffj  —  «2fc 

k{k+l)/2  X 
(one  update) 

requires  a  linesearch, 
may  be  preconditioned 

Table  8.2  Choices  for  the  non-linear  algorithm 

With  the  exception  of  the  line-search,  all  of  these  algorithms  are  used  to  generate 
a  preconditioning  for  the  linear  algorithm,  i.c.  the  formulas  for  the  outer  algorithm 
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implicitly  describe  some  linear  operator  which  can  then  be  applied  to  any  vector.  Thus, 
these  algorithms  can  be  used  only  with  an  algorithm  which  can  be  preconditioned.  Trust- 
region  methods  could  also  be  used  as  non-linear  algorithms  for  a  truncated-Newton 
code,  but  they  will  not  be  discussed  here.  A  summary  of  operation  counts  and  storage 
requirements  can  be  found  in  Table  8.2  above. 

A  linesearch  is  the  simplest  algorithm  that  is  quaranteed  to  converge  that  could  be 
used  for  the  non-linear  outer  iteration  in  a  truncated-Newton  code;  also,  a  linesearch  will 
be  a  part  of  all  the  other  algorithms  to  be  described  in  this  sub-section.  Thus,  such  an 
algorithm  would  be  central  to  any  program  using  a  linesearch  stategy.  The  operation 
count  is  difficult  to  estimate,  since  it  will  depend  on  how  many  guesses  are  needed  to 
“sufficiently  decrease”  the  value  of  the  objective  function.  If  k  guesses  are  used,  then 
{2k  -1-  l)n  operations  and  k  function-gradient  evaluations  will  be  required.  For  many 
problems,  k  will  be  equal  to  1  as  the  minimum  is  approached.  Our  numerical  tests  have 
indicated  that  truncated-Newton  methods  compute  well-scaled  search  directions,  and 
that  k  is  often  equal  to  1  when  the  Hessian  matrix  is  positive-definite.  An  efficient  linear 
search  can  be  difficult  to  program,  but  sample  programs  are  often  found  in  program 
libraries  and  even  on  pocket  calculators. 

The  simplest  way  to  generate  a  preconditioning  for  the  linear  algorithm  is  to  use 
a  non-linear  conjugate-gradient  algorithm.  Beale’s  method  is  a  variant  of  a  non-linear 
conjugate-gradient  algorithm  in  which  the  new  search  direction  is  computed  using  both 
the  most  recent  direction  as  well  as  the  first  direction. 

Limited-memory  quasi-Newton  algorithms  are  almost  as  flexible  as  truncated- 
Newton  methods,  because  it  is  possible  to  choose  both  the  type  of  quasi-Newton  update 
to  use  as  well  as  the  total  number  of  updates.  There  is  some  evidence  to  indicate  (see 
Fenelon  [1981])  that  choosing  k  bigger  than  2  is  not  economical.  It  should  be  noted  that 
a  diagonal  preconditioning  can  be  added  to  this  algorithm  (section  5.3). 

It  is  also  possible  to  consider  using  a  quasi-Newton  method  to  precondition  the  linear 
algorithm.  Unlike  all  the  other  methods  considered  in  this  section,  quasi-Newton  methods 
have  storage  and  operation  counts  which  are  quadratic,  not  linear,  in  n.  For  this  reason, 
it  is  difiicult  to  imagine  them  being  competitive  with  the  other  methods  proposed  here. 
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8.2.3.  Linear  preconditionings 

Most  of  the  preconditionings  generated  during  the  linear  subiteration  were  discussed 
in  detail  in  Chapter  5.  However,  a  few  of  them  were  only  alluded  to  in  passing.  In  this 
sub-section  we  will  consider  the  following  options: 

1.  BFGS  diagonal  preconditioning  (section  5.3) 

2.  rank-one  diagonal  preconditioning  (section  5.3) 

3.  exact  BFGS  diagonal  preconditioning  (sections  5.3  and  5.4) 

4.  exact  diagonal  of  the  Hessian 

5.  tridiagonal  preconditioning  based  on  (section  5.5) 

6.  tridiagonal  preconditioning  based  on  R~^TR'~^  (section  5.5) 

7.  product  of  the  tridiagonal  preconditionings  (section  5.6) 

8.  exact  BFGS  tridiagonal  preconditioning 

9.  exact  BFGS  tridiagonal  factors  preconditioning 

As  in  the  previous  section,  these  options  can  only  be  used  with  a  preconditioned  linear 
algorithm.  Their  operation  counts  and  storage  requirements  are  summarized  in  Table 
8.3  below. 

Because  a  new  preconditioning  is  being  developed  while  the  old  one  is  still  in  use, 
two  copies  of  the  operator  must  be  kept  when  using  all  but  the  fourth  preconditioning 
algorithm.  Since  the  rank-one  formula  does  not  guarantee  positive-definiteness  for  the 
preconditioning,  some  strategy  must  be  designed  to  modify  the  diagonal  when  negative 
elements  arise.  The  exact  BFGS  formula  requires  a  separate  initialization  step  if  a  non¬ 
linear  preconditioning  is  also  being  used  (see  section  5.3). 

The  two  tridiagonal  preconditionings  (options  5  and  6  above)  are  very  similar.  They 
have  the  same  storage  requirements,  and  the  programs  which  implement  them  have  only 
minor  differences.  When  using  option  6,  however,  it  is  easy  to  re-orthogonalize  the  new 
Lanczos  vector  using  the  projection  matrix  R.  Loss  of  orthogonality  can  seriously  degrade 
the  performance  of  conjugate-gradient  and  Lanczos  algorithms;  re-orthogonalization  can 
significantly  improve  stability  and  convergence. 

■  It  is  only  possible  to  precondition  using  the  diagonal  elements  of  the  Hessian  if  these 
elements  can  be  computed  at  little  cost.  If  this  is  feasible,  then  this  is  a  simple  and 
inexpensive  preconditioning  to  use.  If  the  Hessian  is  not  positive  definite  everywhere, 
some  strategy  must  be  devised  to  handle  negative  diagonal  elements.  They  might  be  set 
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to  some  small  positive  value,  or  their  absolute  value  might  be  used;  it  is  also  possible 
to  use  the  negative  diagonals  to  compute  directions  of  negative  curvature.  This  latter 
option  could  be  used  in  place  of  the  inner  algorithm  at  the  current  iteration.  A  direction 
of  negative  curvature  could  be  computed  immediately  from  the  exact  Hessian  information 
and  used  in  the  line  search  with  little  cost. 


ALGORITHM 

STORAGE 

OPERATIONS 

COMMENTS 

BFGS  diagonal 

D,  Di,  fli 

9n  to  update, 
n  to  apply 

rank-one  diagonal 

D,  Di, 

6n  to  update, 
n  to  apply 

may  be  indefinite 

exact  BFGS  diagonal 

D,  Dt 

8n  to  update, 
n  to  apply 

requires  initialization 

diagonal  of  Hessian 

D 

n  to  apply 

may  be  costly  to  obtain, 
may  be  indefinite 

tridiagonal  VTV"^ 

{k  projectors) 

3i  —  safc+i 

(4A:  -f  l)n  to  apply, 
(fc^  -f  k)n  to  form 

requires  long  program 

tridiagonal  R~'^TR~^ 

(fc  projectors) 

—  «2fc+l 

{Ak  -H  l)n  to  apply, 
[k^  -f  k)n  to  form 

requires  long  program 

tridiagonal  product 
{k  projectors) 

—  »2fc-|-l 

(4A:  -1-  l)n  to  apply, 
(k^  +  k)n  to  form 

requires  long  program 

BFGS  exact  tridiagonal 

D,  Di,  E,  Ei 

6n  to  apply, 

14n  to  update 

may  be  indefinite 

Table  8.3  Possible  preconditioning  strategies 

The  next  three  preconditionings  are  considerably  more  complex  and  expensive  to 
apply  than  any  of  the  other  preconditioning  strategies  examined  here.  This  is  because 
they  involve  the  projected  Hessians,  and  information  about  the  projection  matrices  must 
be  computed  and  stored.  There  is  also  some  choice  about  how  much  information  will  be 
used;  we  assume  here  that  k  Lanezos  vectors  are  needed,  and  that  k  is  small  in  comparison 
with  n.  Since  these  preconditionings  use  information  from  the  Lanezos  algorithm,  they 
cannot  be  used  with  the  regular  linear  conjugate-gradient  method. 

It  is  possible  to  use  the  correspondence  between  the  quasi-Newton  and  linear 
conjugate-gradient  algorithms  to  generate  not  just  the  diagonal,  but  also  the  principle 
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subdiagonal  of  Hessian  matrix.  This  method  is  almost  as  easy  to  program  as  the  diagonal 
preconditionings  above;  however,  there  is  a  problem  with  positive-definiteness.  Although 
the  diagonal  and  the  complete  Hessian  approximation  can  be  guaranteed  to  be  positive 
definite,  the  tridiagonal  submatrix  may  be  indefinite,  and  some  strategy  must  be  derived 
for  modifying  it  in  this  case. 

Because  of  the  problem  with  indefiniteness  for  the  preceding  method,  it  would  be 
preferable  to  update  the  diagonal  and  subdiagonal  of  the  Cholesky  factor  of  the  ap¬ 
proximate  Hessian.  This  would  guarantee  a  positive-definite  preconditioning.  Unfor¬ 
tunately,  this  is  infeasible.  An  examination  of  the  formulas  for  updating  matrix  fac¬ 
torizations  in  Gill,  et  al.  [1974]  shows  that  updating  a  portion  of  the  factorization  other 
than  the  diagonal  requires  knowledge  of  the  complete  factorization  of  the  old  precon¬ 
ditioning.  Even  if  this  information  were  available,  accessing  it  would  be  an  O(n^)  process. 
Since  the  preconditioning  is  being  updated  at  every  linear  sub-iteration,  this  method  is 
uneconomical.  Because  generally  very  few  linear  iterations  are  performed,  even  making 
an  update  of  this  type  once  per  outer  iteration  would  often  be  impractical. 


8.2.4.  Termination  criteria  for  the  linear  algorithm 

In  Chapter  4,  we  considered  the  following  convergence  criteria  for  the  linear  algo¬ 
rithm: 

1.  llr.ll/IIgWlI 

2. 

3-  ie,+i-«,iVll9<‘>ll 

4-  ie.+i-«,i*/ie,+il* 

All  of  these  formulas  require  2n-3n  operations  to  compute.  In  some  cases,  for  example 
criterion  1  in  combination  with  a  MINRES  algorithm,  they  are  a  natural  by-product  of 
the  algorithm.  When  the  Lanczos  algorithm  is  used  for  the  linear  sub-iteration  it  may  be 
necessary  to  compute  and  store  the  residual  in  order  to  apply  these  tests.  They  are  all 
easy  to  program.  The  choice  of  a  forcing  sequence  (see  section  4.4)  can  be  made  solely 
on  the  basis  of  numerical  tests. 
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8.2.5.  Computing  matrix/vector  products 

There  are  three  principle  methods  of  obtaining  the  matrix/vector  products  Gp 
required  during  the  linear  sub-iteration.  They  are: 

1.  finite  differencing  along  p 

2.  computing  G  using  (sparse)  finite-differencing  (section  2.5) 

3.  computing  G 

Which  method  is  used  depends  on  the  function  being  minimized;  it  does,  however,  have 
an  important  bearing  on  the  remainder  of  the  algorithm.  If  the  Hessian  is  difficult  to 
compute,  or  is  large  and  dense,  then  finite  differencing  along  p  may  be  the  only  option 
available.  This  discourages  the  use  of  a  large  number  of  linear-subiterations  since  an 
additional  gradient  evaluation  is  required  for  each  matrix/vector  product.  For  problems 
where  the  function  and  its  first  and  second  derivatives  are  inexpensive  to  compute  relative 
to  the  cost  of  solving  the  linear  system,  we  would  again  perform  few  linear  iterations,  as 
the  cost  of  the  linear  sub- algorithm  would  dominate  the  cost  of  the  function  and  gradient 
evaluations. 

When  G  is  available  and  the  function  is  moderately  expensive  to  compute,  a  larger 
number  of  inner  iterations  would  be  encouraged.  In  this  case,  the  cost  of  computing  the 
function  and  its  derivatives  dominates  the  cost  of  the  linear  sub-iteration  and  is  the  same 
at  every  non-linear  outer  iteration. 

Unfortunately,  few  absolute  statements  can  be  made  about  choosing  this  segment  of 
the  truncated-Ncwton  algorithm;  a  decision  should  be  made  in  the  context  of  a  specific 
problem  or  class  of  problems. 


8.3.  Choosing  a  complete  truncated-Newton  algorithm 

In  this  section,  we  will  describe  what  we  consider  to  be  the  two  extreme  versions 
of  a  truncated-Ncwton  algorithm.  The  decisions  made  about  the  construction  of  the 
complete  algorithm  are  based  mainly  on  the  size  of  the  machine  being  used — either  very 
small  or  large.  The  ideas  used  to  describe  each  of  these  situations  can  be  easily  applied 
to  more  specialized  cases. 

In  the  introduction  to  this  chapter,  we  mentioned  a  number  of  issues  which  might 
affect  the  choice  of  a  particular  algorithm.  Some  of  these  involved  the  function  being 
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minimized.  Although  a  truncated-Newton  algorithm  can  be  used  for  general  optimiza¬ 
tion,  we  consider  that  it  will  be  most  useful  for  large-scale  minimization  problems. 

In  this  context,  we  take  “large-scale”  to  mean  that  it  is  difficult  to  use  second- 
derivative  information.  This  might  be  because  the  dimension  of  the  problem  is  large, 
in  which  case  storing  or  factoring  the  second-derivative  matrix  is  impossible.  It  might 
also  mean  that  the  Hessian  matrix  is  expensive  to  compute  or  is  unavailable.  A  further 
possibility  is  that  the  Hessian  matrix  may  be  expressed  as  the  product  of  several  large 
sparse  matrices,  and  it  is  uneconomical  to  obtain  its  elements  explicitly  (this  is  the  case 
in  large  constrained  optimization  problems  (see  Chapter  6)). 


8.3.1  The  large-machine  case 

When  working  on  a  large  machine,  the  only  important  consideration  is  efficiently 
finding  the  solution  to  the  problem  in  a  stable  manner.  All  the  necessary  algorithms  are 
assumed  to  be  professionally  coded  and  available  in  a  program  library,  and  the  size  of  the 
program  is  not  an  issue  (since  it  is  pre-compiled  in  an  object-code  library).  The  length 
and  complexity  of  the  algorithms  are  not  factors  in  any  decision.  However,  the  storage 
requirements  for  the  method  (the  number  of  vectors  required)  are  still  important. 

Thus,  a  preconditioned  Lanczos  algorithm  should  be  selected  to  approximately  solve 
the  Newton  equations.  Alternatively,  a  preconditioned  MINRES  algorithm  might  be 
chosen  if  it  could  be  shown  to  be  more  effective  for  the  class  of  problems  being  solved; 
when  terminated  using  ||r,l|,  MINRES  is  almost  as  efficient  as  a  Lanczos  method.  As 
a  non-linear  algorithm,  we  would  probably  select  a  two-step  diagonally-preconditioned 
limited-memory  quasi-Newton  method.  Such  an  algorithm  has  been  shown  to  be  efficient 
and  cost-effective  for  large  optimization  problems  (see  Gill  and  Murray  [1979],  Fenelon 
[1981]),  and  has  performed  well  in  our  numerical  tests  here. 

We  would  choose  one  of  the  diagonal  preconditioning  schemes  to  precondition  the 
linear  algorithm.  They  all  have  low  storage  requirements  and  are  inexpensive  to  generate. 
The  tridiagonal  preconditioning  schemes  are  considerably  more  expensive  to  use  and  less 
successful  in  practise;  they  would  have  to  perform  much  better  in  numerical  tests  before 
they  could  be  recommended  for  general  use.  Our  results  indicate  that  the  exact  BFGS 
diagonal  preconditioning  is  the  most  effective  of  the  diagonal  schemes.  This  choice  is 
based  on  numerical  tests,  storage  requirements,  and  the  stronger  theoretical  justifications 
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for  this  scheme.  Since  we  are  unconcerned  about  the  length  of  the  program,  it  would 
be  possible  to  include  preconditioning  with  the  diagonal  elements  of  the  Hessian  as  a 
user-specified  option  when  these  elements  can  be  computed  easily. 

The  other  options  for  the  algorithm  (the  termination  criterion  and  the  forcing 
sequence)  would  be  chosen  on  the  basis  of  numerical  tests.  It  would  depend  somewhat  on 
the  other  choices  made  for  the  algorithm.  The  method  used  to  compute  matrix/vector 
products  could  be  chosen  by  the  user  of  the  code  at  run-time. 


8.3.2  The  small-machine  case 

The  major  difference  between  the  small-  and  the  large-machine  cases  is  that  the  size 
of  the  program  is  now  a  factor  in  the  choice  of  the  algorithm.  It  is  impossible  to  store  a 
large  code  in  the  memory  of  a  small  machine.  For  this  reason,  simple  iterative  methods 
are  often  preferable  to  direct  methods  for  solving  many  problems.  Also,  because  the  user 
is  often  not  paying  for  computer  time,  and  time-sharing  is  not  in  effect,  storage  can  be 
a  more  important  issue  than  speed  in  the  choice  of  an  algorithm. 

Another  reason  to  favor  simple  and  short  algorithms  is  that  a  small  computer  will 
not  usually  come  complete  with  a  program  library.  The  user  must  either  write  his  own 
programs,  or  at  least  may  be  obliged  to  input  the  program  by  typing.  In  order  to  decrease 
the  probability  of  error,  and  also  to  reduce  the  overall  time  needed  to  solve  a  problem, 
easy-to-program  methods  are  preferred. 

For  these  reasons,  a  preconditioned  conjugate-gradient  algorithm  could  be  chosen  to 
approximately  solve  the  Newton  equations.  In  extreme  cases,  the  preconditioning  step 
could  be  omitted;  it  is,  however,  a  simple  addition  to  the  program  and  it  can  greatly 
speed  convergence.  A  linesearch  could  be  chosen  as  the  non-linear  algorithm.  Because 
of  possible  indefiniteness,  the  search  direction  p  should  be  monitored  at  every  linear 
iteration  to  insure  that  it  is  a  descent  direction.  Although  a  simple  non-linear  conjugate- 
gradient  method  is  easy  to  add,  effective  methods  of  this  type  include  such  features  as 
restarting  strategies  which  can  increase  the  complexity  of  the  code. 

As  in  the  large-machine  case,  a  diagonal  preconditioning  should  be  used,  unless 
storage  is  at  such  a  premium  that  no  preconditioning  can  be  included.  The  limitations 
of  the  small  machine  do  not  seriously  affect  the  choice  of  the  termination  criterion  and 
the  forcing  sequence.  The  matrix/vector  products  would  probably  be  computed  by  finite 


99 


differencing  along  p,  for  reasons  of  simplicity. 

The  analysis  of  these  two  special  cases  gives  some  indication  of  how  a  truncated- 
Newton  algorithm  can  be  adapted  to  a  specific  computing  environment.  Final  decisions 
about  preconditioning  and  termination  rules  must  be  made  on  the  basis  of  numerical 
tests.  Some  recommendations  were  made  on  the  basis  of  the  results  in  Chapter  7.  When 
solving  specific  classes  problems  in  special  environments,  though,  some  of  the  detailed 
choices  might  be  made  differently. 
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Appendix 


The  following  tables  summarize  the  results  of  the  tests  discussed  in  section  7.7.  Chapters  7  and  8  may 
be  useful  in  interpreting  the  tables  given  here. 

Table  1 — Comparison  of  preconditioning  strategies  using  a  subset  of  the  test  functions. 


Preconditioning 

Pent 

GenR« 

GenR 

Cnlls 

Call 

Cheb 

Rank-2  D 
Rank-1  D 
BFGS  D 

Exact  D 

vtV^ 

r-Ttr-1 

Product  T 
BFGS  T 

7  32 

7  32 

7  32 

5  24 

DNC 

8  40 

10  57 

10  174 

33  183 

34  192 
34  188 
39  224 

DNC 

DNC 

DNC 

41  223 

33  315 

34  338 

34  316 

39  355 

DNC 

DNC 

DNC 

41  321 

12  94 

13  101 

12  93 

11  78 

DNC 

DNC 

DNC 

12  93 

12  191 

13  236 
12  194 

11  63 

DNC 

DNC 

DNC 

12  181 

8  83 

8  84 

9  91 

8  77 

17  232 
21  230 

DNC 

14  181 

Table  2 — Comparison  of  forcing  sequences  {ipk}  using  a  subset  of  the  test  functions.  <f>k  =  min 
and  the  forcing  function  (4.3.6)  is  being  used. 


Forcing  Sequence 

Penl 

GenRs 

GenR 

Calls 

Call 

,  Cheb 

V>fc  =  .5 

8 

42 

39 

204 

39 

247 

29 

213 

29 

164 

9 

56 

=  .1 

8 

44 

32 

175 

32 

297 

8 

59 

8 

164 

7 

83 

==  .05 

8 

44 

35 

207 

35 

427 

7 

51 

7 

172 

6 

79 

^k  —  .01 

8 

45 

33 

197 

33 

149 

5 

38 

5 

140 

10 

152 

il^k  =  max{^*,.5} 

■■ 

41 

223 

41 

275 

31 

226 

31 

169 

9 

56 

ipk  =  max{^fc,.l} 

■i 

34 

188 

34 

301 

13 

99 

13 

180 

9 

79 

35 

191 

35 

317 

12 

93 

12 

173 

9 

90 

34 

188 

34 

316 

12 

93 

12 

194 

9 

101 

MAXIT=5 

7 

32 

35 

181 

35 

235 

DNC 

DNC 

11 

69 

MAXIT=10 

7 

32 

35 

194 

35 

316 

19 

141 

19 

189 

9 

79 

MAXIT=15 

7 

32 

34 

188 

34 

316 

15 

113 

15 

187 

9 

92 

MAXIT=20 

7 

32 

34 

188 

34 

316 

14 

106 

14 

201 

9 

91 

101 


Table  3— Comparbon  of  forcing  sequences  {V**;}  using  asubset  of  the  test  functions.  =  min  {l/A:, 
and  the  forcing  function  (4.3.5)  b  being  used. 


« 


Forcing  Sequence 

.  Penl 

GenRs 

GenR 

Calls 

Call 

Cheb 

lf>k  =  1.00* 

7 

32 

37 

207 

37 

391 

10 

77 

10 

177 

11 

130 

Ipk  =  1.50* 

7 

32 

35 

183 

35 

342 

10 

77 

10 

163 

8 

80 

01*  =  2.00* 

7 

32 

34 

190 

34 

344 

11 

88 

11 

181 

10 

96 

0*  =  2.50* 

7 

32 

38 

210 

38 

402 

11 

84 

11 

168 

8 

59 

0*  =  3.00* 

7 

32 

38 

205 

38 

347 

13 

102 

13 

164 

10 

90 

Tables  4 — Comparbon  of  a  number  of  truncated-Newton  routines  with  various  diagonal  preconditionings. 
The  full  set  of  test  functions  b  used  with  ij  =  .25. 

Table  4A — Smaller  functions.  Differencing  along  search  direction. 


Function 

PC 

TNI 

TN2 

TN3 

MINR 

Penl 

1 

7 

29 

7 

32 

7 

32 

7 

38 

start  3 

2 

7 

29 

7 

32 

7 

32 

7 

38 

n  =  50 

3 

7 

29 

7 

32 

7 

32 

7 

39 

Pen2 

1 

11 

55 

10 

56 

10 

57 

16 

93 

Start  3 

2 

10 

49 

10 

56 

10 

56 

17 

100 

n  =  50 

3 

11 

64 

10 

58 

10 

59 

17 

102 

Pen3 

1 

47 

9 

10 

53 

Start  3 

2 

54 

48 

9 

45 

11 

67 

n  =  50 

3 

47 

47 

45 

11 

66 

GenR 

1 

31 

370 

34 

383 

33 

315 

37 

245 

Start  2 

2 

32 

417 

36 

519 

34 

338 

38 

435 

n  =  50 

3 

31 

330 

35 

342 

34 

316 

36 

329 

Call 

1 

9 

176 

10 

175 

12 

191 

11 

182 

Start  1 

2 

10 

273 

10 

191 

13 

236 

11 

235 

n  =  50 

3 

10 

199 

10 

163 

12 

194 

12 

200 

Cal2 

1 

7 

6 

58 

9 

11 

103 

Start  1 

2 

8 

87 

8 

79 

84 

11 

141 

n  =  50 

3 

7 

69 

7 

66 

10 

86 

9 

80 

Cal3 

1 

7 

112 

9 

10 

122 

8 

97 

SUrt  1 

2 

•7 

107 

11 

96 

11 

127 

9 

124 

n  =  50 

3 

7 

112 

11 

101 

11 

114 

9 

123 

102 


Table  4B — Larger  functions.  Differencing  along  search  direction. 


Function 


Pen2 
Start  3 

n=  100 


Pens 
Start  3 
n=  100 


GenR 
Start  2 

n=  100 


Call 
Start  1 
n=  100 


Cal2 
Start  1 

n=  100 


Cals 
Start  1 
n=  100 


6  33 

6  32 

6  29 


11  68 
11  115 
11  65 


61  764 
63  1090 
57  684 


11  413 
10  511 
10  409 


7  154 
9  219 

8  117 


7  154 
7  171 
7  159 


TN2  1 

TNS  1 

MINR 

2 

11 

2 

11 

2 

12 

2 

11 

2 

11 

2 

12 

2 

11 

2 

11 

2 

12 

5 

24 

5 

27 

6 

36 

5 

24 

5 

27 

6 

36 

5 

25 

5 

26 

6 

42 

10 

61 

10 

59 

11 

69 

11 

99 

10 

59 

11 

64 

10 

58 

10 

64 

11 

62 

63 

754 

61 

683 

65 

610 

64 

1046 

68 

979 

65 

653 

62 

796 

64 

672 

67 

1183 

10 

302 

14 

335 

12 

351 

12 

521 

18 

483 

14 

464 

11 

332 

15 

375 

16 

1228 

7 

115 

11 

141 

12 

166 

7 

142 

13 

140 

15 

318 

6 

107 

11 

131 

10 

407 

13 

194 

13 

199 

10 

237 

11 

220 

14 

232 

11 

259 

16 

229 

14 

207 

11 

498 

Table  4C — Miscellaneous  smaller  functions.  Differencing  along  search  directions. 


Function 

PC 

Cheb 

1 

Start  2 

2 

n  =  20 

3 

QOR 

1 

Start  1 

2 

n  =  50 

3 

GOR 

1 

Start  1 

2 

n  =  50 

3 

ChaR 

1 

Start  4 

2 

n  =  25 

3 

8 

74 

8 

72 

7 

53 

6  25 


TN2 

1  TNS 

8 

83 

8 

83 

8 

89 

8 

84 

8 

80 

9 

91 

6 

25 

6 

24 

6 

24 

6 

25 

6 

24 

6 

24 

8 

65 

8 

60 

8 

87 

8 

68 

8 

65 

8 

57 

14 

96 

15 

106 

15 

133 

15 

154 

12 

77 

12 

77 

MINR 


10  79 

10  94 

10  81 


9  77 

9  80 

9  111 


16  117 

15  122 

16  151 


Table  4D — Sparse  finite-difierencing. 


Function 

PC 

TNI 

TN2 

TN3 

MINR 

GenRs 

1 

31 

179' 

34 

193 

33 

183 

37 

187 

Sturt  2 

2 

32 

190 

36 

204 

34 

192 

38 

195 

n  =  50 

3 

31 

168 

35 

183 

34 

188 

36 

184 

GenRs 

1 

61 

357 

63 

356 

61 

348 

65 

339 

Sturt  2 

2 

63 

367 

64 

376 

68 

409 

65 

326 

n=  100 

3 

57 

335 

62 

348 

64 

379 

67 

356 

Cullu 

1 

fl 

71 

10 

77 

12 

94 

11 

87 

Sturt  1 

2 

78 

10 

77 

13 

101 

11 

89 

n  =  50 

3 

78 

10 

77 

12 

93 

12 

94 

Culls 

1 

11 

85 

10 

81 

14 

109 

12 

98 

Sturt  1 

2 

10 

78 

12 

95 

18 

139 

14 

109 

n=  100 

3 

10 

79 

11 

89 

15 

120 

16 

125 

Cul2s 

1 

7 

EH 

6 

43 

9 

64 

11 

71 

Sturt  1 

2 

8 

8 

57 

10 

71 

11 

81 

n  =  50 

3 

7 

7 

50 

10 

71 

9 

68 

Cul2s 

1 

7 

7 

51 

78 

12 

88 

Sturt  1 

2 

9 

64 

7 

50 

92 

15 

111 

3 

II 

o 

o 

3 

8 

57 

6 

44 

78 

10 

73 

CuUs 

1 

EM 

9 

64 

71 

8 

60 

Sturt  1 

2 

9 

64 

85 

9 

67 

n  =  50 

3 

HH 

50 

11 

78 

^^B9 

78 

9 

69 

CuI3s 

1 

■B 

Bi 

13 

92 

■IBS 

Sturt  1 

2 

11 

78 

n==  100 

3 

16 

113 

14 

99 

QORs 

1 

6 

55 

6 

55 

6 

55 

BR 

Sturt  1 

2 

6 

55 

6 

55 

6 

55 

n  =  50 

3 

6 

55 

6 

55 

6 

55 

I^B 

GORs 

1 

7 

64 

KB 

mm 

9 

84 

Sturt  1 

2 

7 

64 

wSM 

9 

85 

n  =  50 

3 

7 

65 

8 

73 

MEM 

9 

83 

ChuRs 

1 

74 

■El 

74 

77 

HQ 

El 

Sturt  4 

2 

77 

79 

81 

19 

n  =  25 

3 

56 

12 

58 

12 

58 

16 

El 

Table  4E — Totals.  To  compute  the  sparse  totals,  non-sparse  results  were  used  whenever  a  sparse  result  was 
not  avmlable. 


Function 

PC 

TNI 

TN2 

TN3 

MINR 

Totab 

1 

222  1404 

232  1473 

243  1553 

260  1613 

sparse 

2 

227  1492 

239  1567 

262  1712 

269  1695 

3 

215  1341 

236  1479 

250  1620 

266  1683 

Totals 

1 

.222  2709 

232  2580 

243  2553 

260  2597 

regular 

2 

227  3477 

239  3417 

262  3180 

269  3276 

3 

215  2539 

236  2613 

250  2581 

266  4752 

104 


Tables  5— Comparison  of  various  truncated-Newton  routines  against  other  optimization  algorithms.  All 
test  functions  are  used  with  i;  =  .25,  .1,  .001. 

Table  6A — Smaller  functions.  Differencing  along  search  direction. 


Function 

n 

PLMA 

QNM 

MNA 

TN 

PBTN 

BTN 

Penl 

.25 

22 

53 

27 

33 

17 

18 

7 

29 

10 

46 

46 

Start  3 

.1 

8 

27 

8 

26 

9 

25 

7 

30 

10 

48 

48 

n  =  50 

.001 

8 

32 

8 

31 

7 

29 

3 

19 

6 

40 

6 

40 

PenZ 

.25 

52 

118 

134 

242 

17 

17 

■0 

64 

10 

60 

14 

89 

Start  3 

.1 

28 

76 

99 

322 

9 

31 

78 

10 

64 

13 

86 

n  =  50 

.001 

15 

71 

73 

341 

6 

26 

■9 

95 

9 

67 

12 

118 

Pen3 

.25 

40 

67 

135 

44 

47 

11 

60 

10 

62 

Start  3 

.1 

38 

63 

150 

12 

44 

9 

46 

9 

52 

10 

61 

n  =  50 

.001 

28 

71 

56 

155 

11 

48 

9 

54 

10 

63 

10 

72 

GenR 

.25 

108 

201 

128 

287 

62 

202 

31 

330 

42 

584 

33 

499 

Start  2 

.1 

119 

263 

118 

323 

66 

257 

33 

348 

36 

469 

32 

518 

n  =  50 

.001 

119 

330 

118 

412 

88 

392 

34 

395 

37 

468 

35 

614 

Call 

.25 

194 

366 

-  162 

191 

7 

9 

10 

199 

8 

227 

16 

978 

Start  1 

.1 

204 

401 

89 

214 

6 

11 

9 

158 

8 

232 

17 

1151 

n  =  50 

.001 

205 

456 

88 

269 

6 

17 

9 

166 

7 

252 

13 

787 

Cal2 

.25 

64 

106 

28 

52 

■1 

4 

7 

69 

7 

104 

mm 

Start  1 

.1 

61 

118 

28 

54 

Bi 

4 

7 

7 

104 

n  =  50 

.001 

60 

123 

28 

67 

hh 

6 

7 

71 

7 

107 

IKI 

Cal3 

.25 

80 

152 

90 

114 

6 

6 

7 

112 

■D 

125 

Start  1 

.1 

78 

155 

59 

135 

5 

7 

7 

112 

127 

7 

n  =  50 

.001 

77 

161 

59 

161 

5 

11 

7 

113 

HH 

125 

8 

289 

105 


Table  5B — Larger  functions.  Differencing  along  search  direction. 


Function 

MM 

PLMA 

QNM 

MNA 

TN 

PBTN 

BTN 

Penl 

.25 

17 

NR 

NR 

2 

11 

2 

11 

Start  3 

.1 

2 

9 

NR 

NR 

2 

11 

2 

11 

O 

O 

II 

.001 

•  2 

NR 

NR 

2 

12 

2 

12 

Pen2 

■a 

28 

NR 

NR 

6 

29 

6 

26 

6  25 

Start  3 

18 

NR 

NR 

6 

30 

.5 

23 

5  24 

n  =  100 

29 

NR 

NR 

6 

41 

5 

33 

5  36 

Pens 

49 

85 

NR 

NR 

11 

65 

11 

75 

13  79 

Start  3 

m 

48 

94 

NR 

NR 

11 

67 

11 

75 

13  89 

n=  100 

■OIHB 

35 

83 

NR 

NR 

11 

77 

11 

89 

11  91 

GenR 

.25 

191 

365 

NR 

NR 

57 

684 

73 

1055 

60  1150 

start  2 

.1 

192 

410 

NR 

NR 

60 

775 

72 

1012 

62  1153 

n=  100 

.001 

188 

528 

Nr 

NR 

58 

782 

63 

1062 

60  1197 

Call 

.25 

423 

819 

NR 

NR 

10 

409 

9 

828 

26  3855 

Start  1 

.1 

429 

854 

NR 

NR 

10 

409 

8 

570 

23  3214 

O 

o 

II 

.001 

416 

905 

NR 

NR 

10 

372 

8 

602 

26  3948 

CaI2 

.25 

112 

204 

NR 

NR 

8 

117 

7 

172 

8  256 

Start  1 

.1 

107 

.206 

NR 

NR 

8 

117 

7 

172 

8  256 

.001 

113 

228 

NR 

NR 

8 

122 

8 

199 

8  259 

.25 

143 

270 

NR 

NR 

7 

159 

7 

196 

9  508 

142 

281 

NR 

NR 

7 

168 

7 

196 

9  636 

.001 

138 

284 

NR 

NR 

7 

176 

7 

195 

9  617 

Table  5C — Miscellaneous  smaller  functions.  Differencing  along  search  direction. 


Function 

»/ 

PLMA 

MNA 

TN 

PBTN 

BTN 

Cheb 

.25 

29 

121 

7 

53 

10 

87 

10 

104 

Start  2 

.1 

24 

116 

8 

68 

9 

105 

9 

96 

O 

11 

.33 

28 

92 

30 

161 

9 

90 

10 

129 

11 

164 

QOR 

14 

29 

il 

3 

3 

25 

5 

25 

5 

29 

Start  1 

14 

29 

11 

3 

3 

25 

5 

25 

5 

29 

n  =  50 

14 

29 

13 

27 

3 

3 

25 

5 

25 

5 

29 

GOR 

.25 

14 

71 

29 

5 

5 

7 

7 

77 

7 

74 

Start  1 

.1 

14 

76 

29 

5 

5 

7 

7 

77 

7 

74 

n  =  50 

Klin 

42 

97 

29 

72 

5 

7 

7 

7 

85 

6 

ChaR 

40 

82 

48 

97 

15 

28 

11 

77 

11 

121 

88 

Start  4 

•1 

37 

76 

46 

122 

16 

48 

11 

75 

11 

91 

10 

94 

n  =  25 

.001 

43 

119 

47 

164 

12 

47 

14 

126 

1 

4 

11 

104 

106 


Table  5D — Sparse  Rnite-diflerencing. 


Function 

V 

PLMA 

QNM 

MNA 

TN 

PBTN 

BTN 

GenRs 

.25 

108 

201 

128 

287 

62 

202 

31 

168 

42 

239 

33 

196 

SUrt  2 

.1 

119 

263 

118 

323 

66 

257 

33 

198 

36 

218 

32 

197 

n  =  50 

.001 

119 

330 

118 

412 

88 

392 

34 

245 

37 

258 

35 

252 

GenRs 

.25 

191 

365 

mgm 

NR 

57 

335 

73 

415 

60 

346 

SUrt  2 

.1 

192 

410 

Hffil 

NR 

60 

370 

72 

430 

62 

398 

n=  100 

.001 

188 

528 

■SI 

NR 

58 

422 

63 

474 

60 

441 

Calls 

.25 

194 

366 

162 

191 

7 

10 

78 

8 

63 

16 

120 

Start  1 

.1 

204 

401 

89 

214 

6 

9 

73 

8 

66 

.17 

130 

n  =  50 

.001 

205 

456 

88 

269 

6 

9 

88 

7 

72 

13 

117 

Calls 

.25 

423 

819 

NR 

NR 

9 

72 

26 

191 

Start  1 

.1 

429 

854 

NR 

NR 

85 

8 

68 

23 

178 

o 

o 

II 

s 

.001 

416 

905 

NR 

NR 

8 

79 

26 

227 

Cal28 

.25 

64 

106 

■1 

4 

■B 

50 

Hi 

11^ 

8 

57 

Start  1 

.1 

61 

118 

4 

Bl 

50 

8 

57 

n  =  50 

.001 

60 

123 

28 

67 

■d 

6 

HI 

52 

8 

59 

Cal2s 

.25 

112 

204 

NR 

NR 

8 

57 

7 

50 

8 

57 

start  1 

.1 

107 

206 

NR 

NR 

8 

57 

7 

50 

8 

57 

n=  100 

.001 

113 

228 

NR 

NR 

8 

59 

8 

58 

8 

59 

CalSs 

.25 

80 

152 

90 

114 

6 

6 

50 

7 

7 

Start  1 

.1 

78 

155 

59 

135 

5 

7 

50 

7 

52 

7 

m 

n  =  50 

.001 

77 

161 

•59 

161 

5 

11 

59 

7 

58 

8 

68 

Cal3s 

.25 

143 

270 

NR 

NR 

7 

7 

-  9 

57 

Start  1 

.1 

142 

281 

NR 

riR 

7 

51 

7 

52 

9 

65 

n=  100 

.001 

138 

284 

NR 

NR  ' 

7 

7 

9 

77 

QORs 

.25 

14 

29 

23 

39 

3 

3 

6 

55 

5 

46 

is 

Start  1 

.1 

14 

29 

13 

27 

3 

3 

6 

55 

5 

46 

n  =  50 

.001 

14 

29 

13 

27 

3 

3 

6 

55 

5 

46 

46 

GORs 

.25 

14 

71 

29 

59 

5 

5 

7 

7 

7 

Start  1 

.1 

14 

76 

29 

59 

5 

5 

7 

7 

7 

64 

n  =  50 

.001 

42 

29 

72 

5 

7 

7 

73 

7 

71 

6 

59 

ChaRs 

.25 

40 

82 

48 

97 

15 

28 

11 

56 

■B 

56 

51 

Start  4 

.1 

37 

76 

46 

122 

16 

48 

11 

58 

HI 

.60 

52 

n  =  25 

.001 

43 

119 

47 

164 

12 

47 

14 

98 

HI 

3 

70 

Table  6E — Totals.  To  compute  the  sparse  totals,  non-sparse  results  were  used  whenever  a  sparse  result  was 
not  available. 


Function 

PLMA 

QNM 

MNA 

TN 

PBTN 

BTN 

Totals,  sparse 

4773  10026 

NA 

NA 

654  4478 

684  4719 

749  5368 

Totals,  regular 

4773  10026 

NA 

NA 

654  7989 

684  10889 

749  24644 

107 
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Tables  6 — Comparison  of  various  values  of  t)  (.25,  .1,  .001,  .5,  .7,  .9)  for  the  best  truncated- Newton  routine. 
All  test  functions  are  used. 

Table  6A — Differencing  along  search  direction. 


Function 

n 

.25 

.1 

.001 

.5 

.7 

.9 

Penl 

50 

3 

19 

16 

41 

16 

41 

16 

41 

Penl 

100 

11 

2 

12 

16 

42 

16 

42 

16 

42 

Pen2 

50 

11 

64 

78 

12 

95 

18 

52 

18 

52 

18 

51 

Pen2 

100 

6 

29 

30 

6 

41 

10 

26 

10 

26 

10 

26 

Pens 

50 

10 

47 

9 

46 

9 

54 

13 

56 

13 

56 

13 

56 

Pens 

100 

11 

65 

11 

67 

11 

77 

14 

69 

14 

69 

14 

69 

GenR 

31 

330 

33 

348 

34 

395 

35 

420 

33 

336 

34 

360 

GenR 

57 

684 

60 

755 

58 

782 

60 

844 

60 

802 

59 

688 

Call 

10 

199 

9 

158 

9 

166 

9 

190 

9 

190 

9 

165 

Call 

bs 

10 

409 

10 

409 

10 

372 

10 

402 

11 

444 

11 

480 

Cal2 

EM 

7 

69 

7 

69 

7 

71 

7 

69 

7 

69 

7 

69 

Cal2 

100 

8 

117 

8 

117 

8 

122 

8 

117 

8 

117 

8 

117 

CalS 

50 

7 

112 

7 

112 

7 

113 

7 

112 

7 

112 

7 

112 

CalS 

100 

7 

159 

7 

168 

7 

176 

7 

159 

7 

159 

7 

159 

Cheb 

7 

53 

8 

68 

9 

90 

7 

53 

9 

70 

12 

119 

QOR 

6 

25 

6 

25 

6 

25 

6 

25 

6 

25 

6 

25 

GOR 

7 

60 

7 

61 

7 

67 

7 

60 

7 

60 

7 

60 

ChaR 

25 

11 

77 

11 

75 

14 

14 

82 

14 

82 

14 

82 

Totals 

215 

2539 

220 

2627 

219 

2803 

264 

2819 

265 

2752 

268 

2718 

Table  6B — Sparse  finite-differencing. 


Function 

n 

.25 

.1  • 

.001 

.5 

.7 

.9 

GenRs 

50 

31 

168 

33 

198 

34 

245 

35 

197 

33 

175 

34 

191 

GenRs 

100 

57 

335 

60 

370 

58 

422 

60 

344 

60 

348 

59 

327 

Calls 

50 

10 

78 

9 

73 

9 

.  88 

9 

65 

9 

65 

9 

65 

Calls 

100 

10 

79 

10 

85 

10 

95 

10 

73 

11 

81 

11 

80 

Cal2s 

50 

7 

50 

7 

50 

7 

52 

7 

50 

7 

50 

7 

50 

Cal2s 

100 

8 

57 

8 

57 

8 

59 

8 

57 

8 

57 

8 

57 

CalSs 

50 

7 

50 

•  7 

50 

7 

59 

7 

50 

7 

50 

7 

50 

Cal3s 

100 

7 

50 

7 

51 

7 

60 

7 

50 

7 

50 

7 

50 

QORs 

50 

6 

55 

6 

55 

6 

55 

6 

55 

6 

55 

6 

55 

GORs 

50 

7 

65 

7 

66 

7 

73 

7 

65 

7 

65 

7 

65 

ChaRs 

25 

11 

56 

11 

58 

14 

98 

14 

66 

14 

66 

14 

66 

Totals 

161 

1043 

165 

1113 

167 

1316 

170 

1072 

169 

1062 

169 

1056 

108 


Tables  7 — Comparison  of  truncated-Ncwtion  and  modified-Newton  algorithms,  ignoring  function/gradient 
evaluations  required  to  compute  the  matrix/vector  products  for  TN. 

Table  7A — Smaller  functions. 


«  Function 

MNA 

TN 

Penl 

.25 

17 

18 

7 

18 

St&rt  3 

.1 

9 

25 

7 

19 

n  =  50 

.001 

7 

29 

3 

15 

Pen2 

.25 

17 

17 

Stkrt  3 

.1 

9 

31 

n  =  50 

.001 

6 

26 

■la 

57 

Pens 

.25 

m 

BM 

SUrt  3 

.1 

9 

n  =  50 

.001 

■n 

tm 

9 

GenR 

.25 

62 

202 

31 

75 

Start  2 

.1 

66 

257 

33 

99 

n  =  50 

.001 

88 

392 

34 

143 

Call 

.25 

7 

10 

18 

St&rt  1 

.1 

6 

9 

19 

n  =  50 

.001 

6 

9 

34 

Cal2 

.25 

4 

4 

7 

8 

Start  1 

.1 

4 

4 

7 

8 

n  =  50 

.001 

4 

6 

7 

Cal3 

.25 

6 

6 

7 

8 

Start  1 

.1 

5 

7 

7 

8 

n  =  50 

.001 

5 

11 

7 

17 

Table  7B — Miscellaneous  smaller  functions  and  totals. 


Function 

MNA 

TN 

Cheb 

.25 

29 

121 

7 

16 

Start  2 

.1 

24 

116 

8 

17 

II 

.001 

30 

161 

9 

29 

QOR 

.25 

3 

3 

6 

7 

Start  1 

.1 

3 

3 

6 

7 

n  =  50 

.001 

3 

3 

6 

7 

GOR 

.25 

5 

5 

7 

9 

SUrt  1 

.1 

5 

5 

7 

10 

n  =  50 

.001 

5 

7 

7 

17 

ChaR 

.25 

15 

28 

11 

23 

Start  4 

.1 

16 

48 

11 

25 

n  =  25 

.001 

12 

47 

14 

56 

Totals 

541 

1753 

347 

910 

109 
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